STAT515 Final Project

1.Introduction

Depression is one of the mental disorders affecting millions of people around the world. Public health needs to study and understand the causes of depression and its risk factors. Depression can have severe consequences on an individual’s life, and it will reduce life quality and increase the risk of suicide. Therefore, we have chosen to focus our efforts on investigating depression and the factors causing that. For this reason we decide to work on depression.

2.Background

As the prevalence of depression and anxiety disorders rises globally, we face increasing public health challenges. As shown in the figure, the prevalence of anxiety and depression varies significantly from country to country, suggesting the need to develop and implement effective treatment strategies tailored to each region. Global attention and treatment for depression and anxiety disorders is urgent, and we must take action to reduce the health burden of these disorders.

This visualization is an interactive map with the addition of the shiny feature, which means that we have taken into account the needs of some colorblind groups, and we can switch the background color at will so that different viewers can access the information in the graphic.

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: sp
## 
## Attaching package: 'raster'
## The following object is masked from 'package:dplyr':
## 
##     select
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
## Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
## remotes::install_github('r-tmap/tmap')
## tmap mode set to interactive viewing

This image is a world map showing the global prevalence of anxiety and depression. The different colors on the map represent the prevalence of anxiety and depression in different countries, ranging from 0% to 50%. For example, the dark red color represents a prevalence of 40% to 50%, while the light yellow color indicates a prevalence of 0% to 10%. As can be seen from the figure, anxiety and depression are unevenly distributed globally, with some countries having significantly higher prevalence rates than others.

3.Depression Dataset

3.1.1 Introduction of Dataset

The depression dataset is taken from https://ourworldindata.org/ and it includes 1147 individual records and 36 attributes such as sex ,age marital status , Number of children in the household, Household size, Years of education, Consumption of nondurable goods, Value of durable assets,value of cell phone assets,Savings assets,Total owned land assets,Total food consumption,Alcohol consumption,Tobacco consumption,Consumption of medical care,Consumption of children’s medical care,Consumption of education,Consumption of social activities,Other consumption, Nonagricultural income,Flow cost of nonagricultural business,Total cost,Frequency of purchasing full-price food items on a regular basis,How often children buy full-price food,Meat food consumption,Whether the diet is adequate,Frequency of sleep deprivation due to hunger, Number of days household members were sick, Number of deaths of children under five years old,Expenses on education,School attendance rate,Investment in durable goods,Investment in nondurable goods,Depressed status. This dataset does not include any missing values but it is not balanced, the numbeor of records that have depressed value equal to 0 is 953 and the number of records which have depression value equal to 1 is 194.

3.1.2 structure of dataset

## Loading required package: lattice
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## corrplot 0.92 loaded
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:raster':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

3.2 Explore the data

3.2.1 Box plot and Bar plot.

## PhantomJS not found. You can install it with webshot::install_phantomjs(). If it is installed, please make sure the phantomjs executable can be found via the PATH variable.
Shiny applications not supported in static R Markdown documents

3.2.2 HeatMap

This heat map shows the correlations between variables, where shades of color indicate differences in the strength of the correlation. Many of the darker colored grids in the heat map indicate weaker correlations between these variables. For this reason we determined the direction of our research

1. low correlation between some of the data features: this affects the predictive performance of the linear model and given the presence of some categorical variables we will use logistic regression models and decision tree models.

2. category imbalance: there are particularly few non-depression categories in this dataset, which makes it difficult for the model to learn valid patterns from the data, and therefore the correlation matrix may show weak correlations (we demonstrate this later in the model, indicates a high rate of false positives.). In this case, the use of techniques such as SMOTE may help to enhance the performance of the model by balancing the categories to allow the model to better learn a small number of classes.

3. the need for data preprocessing: there are no NULL values in this dataset, but a large number of 0 values appear. For this we need a study to explore the meaning of 0 values and process them .

4. Data preprocessing

4.1 PCA

4.1.1 PCA application

## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.5159 1.76735 1.62361 1.56583 1.39430 1.34675 1.22116
## Proportion of Variance 0.1862 0.09187 0.07753 0.07211 0.05718 0.05335 0.04386
## Cumulative Proportion  0.1862 0.27804 0.35557 0.42769 0.48487 0.53821 0.58207
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.13455 1.06340 1.03218 1.02295 0.95435 0.94237 0.92951
## Proportion of Variance 0.03786 0.03326 0.03133 0.03078 0.02679 0.02612 0.02541
## Cumulative Proportion  0.61993 0.65319 0.68452 0.71530 0.74209 0.76821 0.79362
##                           PC15   PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.90396 0.8882 0.81877 0.80343 0.78846 0.75410 0.74131
## Proportion of Variance 0.02403 0.0232 0.01972 0.01899 0.01828 0.01673 0.01616
## Cumulative Proportion  0.81765 0.8409 0.86057 0.87956 0.89784 0.91457 0.93073
##                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.68170 0.64688 0.60510 0.54098 0.48887 0.45130 0.42257
## Proportion of Variance 0.01367 0.01231 0.01077 0.00861 0.00703 0.00599 0.00525
## Cumulative Proportion  0.94440 0.95671 0.96748 0.97608 0.98311 0.98910 0.99436
##                           PC29    PC30    PC31    PC32     PC33      PC34
## Standard deviation     0.35686 0.22364 0.10839 0.05292 0.001526 2.008e-08
## Proportion of Variance 0.00375 0.00147 0.00035 0.00008 0.000000 0.000e+00
## Cumulative Proportion  0.99810 0.99957 0.99992 1.00000 1.000000 1.000e+00
##  [1] 0.1861730 0.2780412 0.3555743 0.4276868 0.4848652 0.5382108 0.5820708
##  [8] 0.6199299 0.6531895 0.6845244 0.7153017 0.7420895 0.7682092 0.7936207
## [15] 0.8176543 0.8408564 0.8605737 0.8795590 0.8978435 0.9145692 0.9307322
## [22] 0.9444003 0.9567076 0.9674765 0.9760841 0.9831133 0.9891037 0.9943555
## [29] 0.9981010 0.9995720 0.9999176 0.9999999 1.0000000 1.0000000
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 418  92
##          1   5   1
##                                           
##                Accuracy : 0.812           
##                  95% CI : (0.7756, 0.8448)
##     No Information Rate : 0.8198          
##     P-Value [Acc > NIR] : 0.7             
##                                           
##                   Kappa : -0.0017         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.98818         
##             Specificity : 0.01075         
##          Pos Pred Value : 0.81961         
##          Neg Pred Value : 0.16667         
##              Prevalence : 0.81977         
##          Detection Rate : 0.81008         
##    Detection Prevalence : 0.98837         
##       Balanced Accuracy : 0.49947         
##                                           
##        'Positive' Class : 0               
## 

4.1.2 PCA result analysis

1. Explanatory power of PCA: From the result given by the principal component analysis, the proportion of cumulative variance explained by the first 15 principal components is about 97.36%. This means that most of the information is contained by these 15 components, which, theoretically, is a better data downscaling.

2. model performance (using the properties of PCA):

Confusion matrix: the results show that the model predicted almost all the test samples to be in the negative category (non-depressed) and only one sample was predicted to be in the positive category (depressed), but this prediction was wrong. In fact, there should be 38 positive samples.

Accuracy and sensitivity: the overall accuracy of the model was 82.89%, but the specificity of the model was 0, indicating that it failed to correctly identify any truly depressed samples. This indicates a high rate of false positives.

3. problem analysis:

Data imbalance: the proportion of depressed and non-depressed samples in the dataset is severely imbalanced, and the model will be biased towards the majority class, resulting in high precision but low recall.

Influence of data: although PCA can reduce the dimensionality of the data, the correlation between the original variables is not high enough for PCA to effectively capture information useful for prediction. In addition, the inclusion of a large number of zero values in many variables may affect the effectiveness of PCA because these zeros may represent different meanings (e.g., not recorded or actual value of zero), thus distorting the intrinsic distribution of the data.PCA tends to emphasize variables with large variance. There are some variables in this dataset that have very high variance (e.g., agricultural income) and others that have relatively low variance, then in PCA the There are some variables in this dataset that have very high variance (e.g., agricultural income) and others that have relatively low variance, then in PCA the variables with large variance will have a large impact on the calculation of the principal components, resulting in these principal components reflecting mainly information from the variables with large variance and ignoring other variables that may be just as important, but with low variance.

4. Improve the methodology:

Dealing with the imbanlanced data: As we can see, the visualizations show us the dataset is imbalanced. We are considering use SMOTE or sampling techniques to balance the categories in this dataset, especially our target variable: depression status is highly imbalanced in categories.

Feature engineering: further analyze the variables, especially those containing a large number of 0 values, to understand the specific meaning of these 0 values and consider whether these variables need special treatment, such as variable transformation, filling in missing values, etc.

4.2 Remove the outliers

In the preprocessing stage, removing outliers is of significant importance. Numerous techniques are available for this purpose, and in our project, we opted for the Z-score method. By leveraging the mean and standard deviation of the dataset, this method ensures a confidence level equivalent to 99.7%.

4.2.1 cleaned_dataset

## 'data.frame':    1147 obs. of  35 variables:
##  $ sex                   : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ age                   : num  21 44 23 67 28 23 22 27 59 35 ...
##  $ marital_status        : int  0 1 1 0 1 1 1 1 0 1 ...
##  $ children              : int  3 6 1 0 4 3 3 2 4 6 ...
##  $ household_size        : int  4 8 3 1 6 5 5 4 6 8 ...
##  $ years_of_edu          : int  10 6 7 1 10 8 9 10 10 10 ...
##  $ hh_children           : int  3 6 1 0 0 0 0 2 4 6 ...
##  $ cons_nondurable       : num  358 233 172 37 0 ...
##  $ asset_durable         : num  208.2 11.7 120.9 32.8 0 ...
##  $ asset_phone           : num  40 0 56.1 0 0 ...
##  $ asset_savings         : num  0 12.8 0 0 0 ...
##  $ asset_land_owned_total: num  2 3 0 1.75 0 0 0 0 0.5 1.7 ...
##  $ cons_allfood          : num  231 211.9 81.6 25.7 0 ...
##  $ cons_alcohol          : num  0 1.17 0 0 0 ...
##  $ cons_tobacco          : num  0 0 0 0 0 ...
##  $ cons_med_total        : num  0 0 0 0 0 ...
##  $ cons_med_children     : num  0 0 0 0 1.22 ...
##  $ cons_ed               : num  0 1.521 0.721 0 0 ...
##  $ cons_social           : num  4 0 13.45 2.54 0 ...
##  $ cons_other            : num  122.84 19.97 76.15 8.73 0 ...
##  $ ent_nonag_revenue     : num  72.1 0 0 0 0 ...
##  $ ent_nonag_flowcost    : num  24 0 0 0 0 ...
##  $ ent_total_cost        : num  48.166 0.378 0 4.805 0 ...
##  $ fs_adwholed_often     : num  0 0 0 3 0 0 0 0 0 3 ...
##  $ fs_chwholed_often     : num  0 0 0 0 0.504 ...
##  $ fs_meat               : num  3 5 2 1 3.07 ...
##  $ fs_enoughtom          : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ fs_sleephun           : num  1 0 1 1 0 0 0 1 0 1 ...
##  $ med_sickdays_hhave    : num  1 2.75 2.67 3 1.44 ...
##  $ med_u5_deaths         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ed_expenses           : num  0 18.26 8.65 0 0 ...
##  $ ed_schoolattend       : num  0 0.8 1 0 0 ...
##  $ durable_investment    : num  569.9 252.7 141.7 58.3 0 ...
##  $ nondurable_investment : num  48.166 14.712 0.721 4.805 0 ...
##  $ depression            : num  0 0 0 1 0 1 0 0 0 0 ...
##                    sex                    age         marital_status 
##                   TRUE                   TRUE                   TRUE 
##               children         household_size           years_of_edu 
##                   TRUE                   TRUE                   TRUE 
##            hh_children        cons_nondurable          asset_durable 
##                   TRUE                   TRUE                   TRUE 
##            asset_phone          asset_savings asset_land_owned_total 
##                   TRUE                   TRUE                   TRUE 
##           cons_allfood           cons_alcohol           cons_tobacco 
##                   TRUE                   TRUE                   TRUE 
##         cons_med_total      cons_med_children                cons_ed 
##                   TRUE                   TRUE                   TRUE 
##            cons_social             cons_other      ent_nonag_revenue 
##                   TRUE                   TRUE                   TRUE 
##     ent_nonag_flowcost         ent_total_cost      fs_adwholed_often 
##                   TRUE                   TRUE                   TRUE 
##      fs_chwholed_often                fs_meat           fs_enoughtom 
##                   TRUE                   TRUE                   TRUE 
##            fs_sleephun     med_sickdays_hhave          med_u5_deaths 
##                   TRUE                   TRUE                   TRUE 
##            ed_expenses        ed_schoolattend     durable_investment 
##                   TRUE                   TRUE                   TRUE 
##  nondurable_investment             depression 
##                   TRUE                   TRUE
##                                 sex          age marital_status      children
## sex                     1.000000000 -0.136329281     0.27351242  0.2235183786
## age                    -0.136329281  1.000000000    -0.40009835 -0.1122369397
## marital_status          0.273512416 -0.400098351     1.00000000  0.2213770708
## children                0.223518379 -0.112236940     0.22137707  1.0000000000
## household_size          0.255252384 -0.069977261     0.32478405  0.9348874419
## years_of_edu           -0.076007865 -0.396506438     0.19905840  0.1635600948
## hh_children             0.197450150 -0.071031539     0.15366930  0.6369515831
## cons_nondurable         0.075544133 -0.030918410     0.15027328  0.0885514198
## asset_durable           0.079719412 -0.090569900     0.12976656  0.0532511254
## asset_phone             0.146373413 -0.185179059     0.18265302  0.0818207541
## asset_savings           0.004742042 -0.013258537     0.03104469 -0.0046487712
## asset_land_owned_total -0.010935527  0.256278615    -0.06060419 -0.0486652177
## cons_allfood            0.077389856 -0.026222298     0.13473087  0.0693714365
## cons_alcohol           -0.036498211  0.016826151     0.06514488 -0.0033885692
## cons_tobacco           -0.017540174  0.062568688     0.05734993  0.0209312796
## cons_med_total         -0.104872487  0.016557203     0.08105429  0.0006077517
## cons_med_children      -0.125741746  0.024128426     0.05199862  0.0282779995
## cons_ed                 0.045205835  0.136085173     0.03259242  0.1394922162
## cons_social             0.068544257 -0.034876686     0.06504699  0.0420738659
## cons_other              0.084473371 -0.094560519     0.11707611  0.1037599523
## ent_nonag_revenue       0.037908468 -0.019200138    -0.05142287  0.0426702715
## ent_nonag_flowcost      0.046117780 -0.022769838    -0.05182265  0.0389417970
## ent_total_cost          0.051106928 -0.021587201    -0.04375056  0.0432178223
## fs_adwholed_often      -0.025735795  0.177070258    -0.11757884 -0.0158685586
## fs_chwholed_often       0.032672346  0.082337308    -0.01532264  0.0944021589
## fs_meat                -0.019492346 -0.048549047     0.05497803  0.0001027198
## fs_enoughtom            0.067151860 -0.059864081     0.06155151 -0.0162940716
## fs_sleephun             0.006099500  0.124709659    -0.08256889 -0.0049687286
## med_sickdays_hhave     -0.092736460  0.204572737    -0.16532619 -0.2298859774
## med_u5_deaths           0.042598585 -0.097040359     0.08495699 -0.0639881754
## ed_expenses             0.048682509  0.107937400     0.03391059  0.1693412422
## ed_schoolattend         0.158867156  0.010671546     0.04596508  0.2560814444
## durable_investment      0.109633619 -0.009946984     0.13200595  0.0725051879
## nondurable_investment   0.044284961 -0.015760895    -0.01258186  0.0385884155
## depression             -0.007864312  0.104586878    -0.08135822  0.0093130715
##                        household_size years_of_edu   hh_children
## sex                      0.2552523839 -0.076007865  1.974502e-01
## age                     -0.0699772611 -0.396506438 -7.103154e-02
## marital_status           0.3247840526  0.199058397  1.536693e-01
## children                 0.9348874419  0.163560095  6.369516e-01
## household_size           1.0000000000  0.133991220  6.180790e-01
## years_of_edu             0.1339912201  1.000000000  8.929514e-02
## hh_children              0.6180789557  0.089295136  1.000000e+00
## cons_nondurable          0.1437222124  0.069834490  4.630234e-01
## asset_durable            0.0935210875  0.162811265  3.602833e-01
## asset_phone              0.1170388024  0.165127878  3.667773e-01
## asset_savings            0.0043669668  0.073918494  4.915296e-02
## asset_land_owned_total   0.0053866003 -0.111563942  1.879720e-01
## cons_allfood             0.1203352412  0.039645363  4.123813e-01
## cons_alcohol             0.0122955425 -0.010042348  7.091114e-02
## cons_tobacco             0.0400518195 -0.034187781  1.160074e-01
## cons_med_total           0.0252185584  0.099440142  9.257147e-02
## cons_med_children        0.0520810853  0.097327839  2.017262e-02
## cons_ed                  0.1893597383 -0.004611074  2.671770e-01
## cons_social              0.0658872107  0.077059530  2.393156e-01
## cons_other               0.1337700099  0.128357993  4.131048e-01
## ent_nonag_revenue        0.0239992231  0.037650959  9.529252e-02
## ent_nonag_flowcost       0.0293371793  0.065584958  1.073105e-01
## ent_total_cost           0.0360217142  0.068727241  1.274923e-01
## fs_adwholed_often       -0.0292993730 -0.074643503  1.315172e-01
## fs_chwholed_often        0.0803987262 -0.021563508  6.247656e-02
## fs_meat                  0.0056533046  0.005143384  9.411094e-05
## fs_enoughtom             0.0006370245  0.047467346  1.887765e-01
## fs_sleephun             -0.0037211053 -0.098001282  2.429133e-01
## med_sickdays_hhave      -0.2533737057 -0.128186650 -1.498050e-01
## med_u5_deaths           -0.0412230032 -0.012647319  2.988938e-02
## ed_expenses              0.2063735270  0.016080849  3.005115e-01
## ed_schoolattend          0.2794054024  0.041076636  6.675593e-01
## durable_investment       0.1199287201  0.093549064  3.781532e-01
## nondurable_investment    0.0416125923  0.095695750  1.422898e-01
## depression               0.0020570582 -0.126243493  4.738743e-03
##                        cons_nondurable asset_durable asset_phone asset_savings
## sex                        0.075544133    0.07971941  0.14637341   0.004742042
## age                       -0.030918410   -0.09056990 -0.18517906  -0.013258537
## marital_status             0.150273280    0.12976656  0.18265302   0.031044685
## children                   0.088551420    0.05325113  0.08182075  -0.004648771
## household_size             0.143722212    0.09352109  0.11703880   0.004366967
## years_of_edu               0.069834490    0.16281127  0.16512788   0.073918494
## hh_children                0.463023399    0.36028329  0.36677729   0.049152959
## cons_nondurable            1.000000000    0.44260580  0.39682862   0.117579359
## asset_durable              0.442605797    1.00000000  0.49932481   0.268743316
## asset_phone                0.396828619    0.49932481  1.00000000   0.124437451
## asset_savings              0.117579359    0.26874332  0.12443745   1.000000000
## asset_land_owned_total     0.247368822    0.16229855  0.06397891   0.016327399
## cons_allfood               0.966451134    0.38704219  0.33332285   0.087968572
## cons_alcohol               0.244126180    0.03295366 -0.02207446  -0.005041868
## cons_tobacco               0.226869151    0.03559299 -0.01579521   0.010629950
## cons_med_total             0.314684586    0.11938720  0.07360623   0.003439300
## cons_med_children          0.191441649    0.04138057  0.02349999   0.001116218
## cons_ed                    0.275967835    0.15840657  0.27211681   0.155487327
## cons_social                0.367013887    0.30281759  0.23643087   0.115968420
## cons_other                 0.646331813    0.43325129  0.44569583   0.147054846
## ent_nonag_revenue          0.095780463    0.09550633  0.08172755   0.016340257
## ent_nonag_flowcost         0.103552316    0.13975557  0.18084418   0.046014193
## ent_total_cost             0.140754042    0.16546590  0.20188891   0.057165530
## fs_adwholed_often          0.046922114   -0.01941714 -0.03984742  -0.015823192
## fs_chwholed_often         -0.030576510   -0.07280189 -0.07976236  -0.019647063
## fs_meat                    0.112228106    0.06448843 -0.03484732  -0.018140530
## fs_enoughtom               0.255075344    0.23659578  0.24016406   0.041759047
## fs_sleephun                0.162589103    0.02511811  0.04113945  -0.010747829
## med_sickdays_hhave        -0.008993309   -0.07843424 -0.04943072  -0.033227273
## med_u5_deaths              0.101577790    0.05067042  0.05474938  -0.010191622
## ed_expenses                0.291615160    0.15905972  0.29116466   0.167918539
## ed_schoolattend            0.465165658    0.37999575  0.34876475   0.036464076
## durable_investment         0.492244105    0.77609034  0.40644038   0.257011353
## nondurable_investment      0.193600512    0.29497134  0.24368018   0.648200858
## depression                -0.023045708   -0.04797939 -0.03893431  -0.011961091
##                        asset_land_owned_total cons_allfood cons_alcohol
## sex                               -0.01093553   0.07738986 -0.036498211
## age                                0.25627861  -0.02622230  0.016826151
## marital_status                    -0.06060419   0.13473087  0.065144877
## children                          -0.04866522   0.06937144 -0.003388569
## household_size                     0.00538660   0.12033524  0.012295542
## years_of_edu                      -0.11156394   0.03964536 -0.010042348
## hh_children                        0.18797196   0.41238130  0.070911138
## cons_nondurable                    0.24736882   0.96645113  0.244126180
## asset_durable                      0.16229855   0.38704219  0.032953660
## asset_phone                        0.06397891   0.33332285 -0.022074460
## asset_savings                      0.01632740   0.08796857 -0.005041868
## asset_land_owned_total             1.00000000   0.23271245  0.063913919
## cons_allfood                       0.23271245   1.00000000  0.185506672
## cons_alcohol                       0.06391392   0.18550667  1.000000000
## cons_tobacco                       0.09402880   0.18295743  0.502605353
## cons_med_total                     0.06147515   0.18584180  0.134946387
## cons_med_children                  0.02732393   0.07926619  0.068029884
## cons_ed                            0.12545505   0.20054551  0.024306976
## cons_social                        0.10834378   0.26641775  0.036464179
## cons_other                         0.16725163   0.46836751  0.101625499
## ent_nonag_revenue                  0.01324190   0.05225984 -0.012543798
## ent_nonag_flowcost                 0.01585542   0.07865089 -0.017115253
## ent_total_cost                     0.03064688   0.11290094 -0.012449977
## fs_adwholed_often                  0.14552518   0.03672290  0.013146876
## fs_chwholed_often                  0.01582205  -0.03405814 -0.021196950
## fs_meat                            0.03132778   0.09854554  0.126341951
## fs_enoughtom                       0.13028491   0.22580756  0.034942482
## fs_sleephun                        0.18034645   0.15122503  0.037142748
## med_sickdays_hhave                 0.04628771  -0.01849022  0.028068731
## med_u5_deaths                      0.01897216   0.06631980  0.198556064
## ed_expenses                        0.10082290   0.21840627  0.023067379
## ed_schoolattend                    0.20008855   0.42632177  0.052493236
## durable_investment                 0.25576878   0.45779919  0.039463321
## nondurable_investment              0.04069567   0.15021993 -0.010906572
## depression                         0.01096124  -0.02350565  0.042121172
##                         cons_tobacco cons_med_total cons_med_children
## sex                    -0.0175401741  -0.1048724867     -0.1257417460
## age                     0.0625686885   0.0165572033      0.0241284261
## marital_status          0.0573499291   0.0810542890      0.0519986162
## children                0.0209312796   0.0006077517      0.0282779995
## household_size          0.0400518195   0.0252185584      0.0520810853
## years_of_edu           -0.0341877809   0.0994401422      0.0973278393
## hh_children             0.1160073818   0.0925714682      0.0201726238
## cons_nondurable         0.2268691507   0.3146845860      0.1914416493
## asset_durable           0.0355929853   0.1193871972      0.0413805697
## asset_phone            -0.0157952145   0.0736062280      0.0234999920
## asset_savings           0.0106299497   0.0034392999      0.0011162182
## asset_land_owned_total  0.0940288005   0.0614751461      0.0273239283
## cons_allfood            0.1829574311   0.1858418009      0.0792661942
## cons_alcohol            0.5026053525   0.1349463873      0.0680298840
## cons_tobacco            1.0000000000   0.0719252442      0.0126404776
## cons_med_total          0.0719252442   1.0000000000      0.7792911115
## cons_med_children       0.0126404776   0.7792911115      1.0000000000
## cons_ed                 0.0941021639   0.0569848480      0.0584086055
## cons_social             0.0309225723   0.1307350466      0.0667593708
## cons_other              0.1034908050   0.2855831866      0.2391992748
## ent_nonag_revenue      -0.0066277729   0.0038541657     -0.0012223591
## ent_nonag_flowcost     -0.0054646436   0.0252404456      0.0029284686
## ent_total_cost         -0.0007553423   0.0349950410      0.0080527164
## fs_adwholed_often       0.0524480332  -0.0006815192     -0.0309301807
## fs_chwholed_often       0.0200372006  -0.0255437548     -0.0251468664
## fs_meat                 0.1005412728   0.1093533902      0.0498740301
## fs_enoughtom            0.0464868087   0.0257174322      0.0143536900
## fs_sleephun             0.1123748169   0.0510615575     -0.0113235128
## med_sickdays_hhave      0.0279649144   0.1386186377      0.0434055344
## med_u5_deaths           0.1177856759   0.0571678294      0.0002486552
## ed_expenses             0.0913978574   0.0637326255      0.0660529705
## ed_schoolattend         0.0882712527   0.1154333560      0.0603528471
## durable_investment      0.0423857944   0.0808779207      0.0203987673
## nondurable_investment   0.0116056837   0.0320561146      0.0103851547
## depression              0.0075174326   0.0221150547      0.0127304276
##                             cons_ed cons_social  cons_other ent_nonag_revenue
## sex                     0.045205835  0.06854426  0.08447337       0.037908468
## age                     0.136085173 -0.03487669 -0.09456052      -0.019200138
## marital_status          0.032592424  0.06504699  0.11707611      -0.051422866
## children                0.139492216  0.04207387  0.10375995       0.042670272
## household_size          0.189359738  0.06588721  0.13377001       0.023999223
## years_of_edu           -0.004611074  0.07705953  0.12835799       0.037650959
## hh_children             0.267177027  0.23931561  0.41310477       0.095292518
## cons_nondurable         0.275967835  0.36701389  0.64633181       0.095780463
## asset_durable           0.158406566  0.30281759  0.43325129       0.095506333
## asset_phone             0.272116811  0.23643087  0.44569583       0.081727551
## asset_savings           0.155487327  0.11596842  0.14705485       0.016340257
## asset_land_owned_total  0.125455052  0.10834378  0.16725163       0.013241904
## cons_allfood            0.200545514  0.26641775  0.46836751       0.052259845
## cons_alcohol            0.024306976  0.03646418  0.10162550      -0.012543798
## cons_tobacco            0.094102164  0.03092257  0.10349081      -0.006627773
## cons_med_total          0.056984848  0.13073505  0.28558319       0.003854166
## cons_med_children       0.058408605  0.06675937  0.23919927      -0.001222359
## cons_ed                 1.000000000  0.19281115  0.16976686       0.026746125
## cons_social             0.192811146  1.00000000  0.34558889       0.100996801
## cons_other              0.169766865  0.34558889  1.00000000       0.228142993
## ent_nonag_revenue       0.026746125  0.10099680  0.22814299       1.000000000
## ent_nonag_flowcost      0.057504433  0.12752960  0.13639899       0.564953185
## ent_total_cost          0.072102693  0.15161507  0.16542771       0.565750337
## fs_adwholed_often       0.082810120  0.01929808  0.04653217       0.170287252
## fs_chwholed_often       0.048097379  0.01403319 -0.01714582      -0.030269601
## fs_meat                -0.041333167  0.02076895  0.07691808       0.028099526
## fs_enoughtom            0.037661735  0.16629961  0.26905531       0.064079249
## fs_sleephun             0.096060425  0.08164118  0.10218572      -0.021059473
## med_sickdays_hhave     -0.031241762 -0.02306961 -0.02347482      -0.039512111
## med_u5_deaths          -0.011135655  0.17452733  0.10810966       0.003023409
## ed_expenses             0.932086917  0.20766443  0.18847510       0.030952045
## ed_schoolattend         0.261744320  0.23473849  0.36779176       0.084188394
## durable_investment      0.211327294  0.34838415  0.38004750       0.094004667
## nondurable_investment   0.209248308  0.19568712  0.22320715       0.439267596
## depression              0.005230865 -0.01787429 -0.03537107      -0.013235113
##                        ent_nonag_flowcost ent_total_cost fs_adwholed_often
## sex                           0.046117780   0.0511069280     -0.0257357950
## age                          -0.022769838  -0.0215872006      0.1770702578
## marital_status               -0.051822652  -0.0437505578     -0.1175788363
## children                      0.038941797   0.0432178223     -0.0158685586
## household_size                0.029337179   0.0360217142     -0.0292993730
## years_of_edu                  0.065584958   0.0687272406     -0.0746435034
## hh_children                   0.107310547   0.1274922800      0.1315172316
## cons_nondurable               0.103552316   0.1407540419      0.0469221139
## asset_durable                 0.139755572   0.1654659009     -0.0194171359
## asset_phone                   0.180844181   0.2018889144     -0.0398474210
## asset_savings                 0.046014193   0.0571655304     -0.0158231919
## asset_land_owned_total        0.015855423   0.0306468790      0.1455251756
## cons_allfood                  0.078650890   0.1129009371      0.0367229003
## cons_alcohol                 -0.017115253  -0.0124499772      0.0131468760
## cons_tobacco                 -0.005464644  -0.0007553423      0.0524480332
## cons_med_total                0.025240446   0.0349950410     -0.0006815192
## cons_med_children             0.002928469   0.0080527164     -0.0309301807
## cons_ed                       0.057504433   0.0721026929      0.0828101197
## cons_social                   0.127529600   0.1516150670      0.0192980819
## cons_other                    0.136398994   0.1654277127      0.0465321700
## ent_nonag_revenue             0.564953185   0.5657503370      0.1702872519
## ent_nonag_flowcost            1.000000000   0.9969816926      0.0097189489
## ent_total_cost                0.996981693   1.0000000000      0.0105675360
## fs_adwholed_often             0.009718949   0.0105675360      1.0000000000
## fs_chwholed_often            -0.037625415  -0.0382841831      0.5425710592
## fs_meat                       0.079025266   0.0806493905     -0.1340449846
## fs_enoughtom                  0.074181063   0.0869145220     -0.0506673422
## fs_sleephun                  -0.012308726  -0.0049587045      0.3822554618
## med_sickdays_hhave           -0.017630569  -0.0153623822      0.1360304798
## med_u5_deaths                 0.004037221   0.0059919905      0.0091447394
## ed_expenses                   0.061693592   0.0763514717      0.0906725199
## ed_schoolattend               0.120118431   0.1403973120      0.1428906028
## durable_investment            0.135622413   0.1746417045     -0.0140809629
## nondurable_investment         0.784985236   0.7948137557      0.0037197249
## depression                   -0.035796973  -0.0378227725      0.1450221967
##                        fs_chwholed_often       fs_meat  fs_enoughtom
## sex                          0.032672346 -1.949235e-02  0.0671518599
## age                          0.082337308 -4.854905e-02 -0.0598640806
## marital_status              -0.015322635  5.497803e-02  0.0615515104
## children                     0.094402159  1.027198e-04 -0.0162940716
## household_size               0.080398726  5.653305e-03  0.0006370245
## years_of_edu                -0.021563508  5.143384e-03  0.0474673461
## hh_children                  0.062476564  9.411094e-05  0.1887764518
## cons_nondurable             -0.030576510  1.122281e-01  0.2550753442
## asset_durable               -0.072801888  6.448843e-02  0.2365957828
## asset_phone                 -0.079762362 -3.484732e-02  0.2401640635
## asset_savings               -0.019647063 -1.814053e-02  0.0417590469
## asset_land_owned_total       0.015822048  3.132778e-02  0.1302849086
## cons_allfood                -0.034058140  9.854554e-02  0.2258075564
## cons_alcohol                -0.021196950  1.263420e-01  0.0349424823
## cons_tobacco                 0.020037201  1.005413e-01  0.0464868087
## cons_med_total              -0.025543755  1.093534e-01  0.0257174322
## cons_med_children           -0.025146866  4.987403e-02  0.0143536900
## cons_ed                      0.048097379 -4.133317e-02  0.0376617349
## cons_social                  0.014033191  2.076895e-02  0.1662996123
## cons_other                  -0.017145817  7.691808e-02  0.2690553137
## ent_nonag_revenue           -0.030269601  2.809953e-02  0.0640792492
## ent_nonag_flowcost          -0.037625415  7.902527e-02  0.0741810632
## ent_total_cost              -0.038284183  8.064939e-02  0.0869145220
## fs_adwholed_often            0.542571059 -1.340450e-01 -0.0506673422
## fs_chwholed_often            1.000000000 -7.399030e-02 -0.0831790470
## fs_meat                     -0.073990305  1.000000e+00  0.0794455769
## fs_enoughtom                -0.083179047  7.944558e-02  1.0000000000
## fs_sleephun                  0.184778164 -1.216907e-01 -0.1208344766
## med_sickdays_hhave          -0.004417657  3.212457e-03 -0.0755126046
## med_u5_deaths                0.006057373  4.229920e-03  0.0164768184
## ed_expenses                  0.053368945 -4.436123e-02  0.0143314940
## ed_schoolattend              0.102065060 -6.065066e-03  0.1622906795
## durable_investment          -0.064374306  4.887798e-02  0.2552065531
## nondurable_investment       -0.037660830  4.762709e-02  0.0929336964
## depression                   0.066157458  1.771539e-02 -0.0208881386
##                         fs_sleephun med_sickdays_hhave med_u5_deaths
## sex                     0.006099500      -0.0927364596  0.0425985853
## age                     0.124709659       0.2045727368 -0.0970403586
## marital_status         -0.082568890      -0.1653261926  0.0849569889
## children               -0.004968729      -0.2298859774 -0.0639881754
## household_size         -0.003721105      -0.2533737057 -0.0412230032
## years_of_edu           -0.098001282      -0.1281866501 -0.0126473194
## hh_children             0.242913301      -0.1498050212  0.0298893786
## cons_nondurable         0.162589103      -0.0089933088  0.1015777896
## asset_durable           0.025118106      -0.0784342418  0.0506704246
## asset_phone             0.041139450      -0.0494307200  0.0547493819
## asset_savings          -0.010747829      -0.0332272730 -0.0101916217
## asset_land_owned_total  0.180346449       0.0462877072  0.0189721645
## cons_allfood            0.151225034      -0.0184902188  0.0663198015
## cons_alcohol            0.037142748       0.0280687310  0.1985560641
## cons_tobacco            0.112374817       0.0279649144  0.1177856759
## cons_med_total          0.051061557       0.1386186377  0.0571678294
## cons_med_children      -0.011323513       0.0434055344  0.0002486552
## cons_ed                 0.096060425      -0.0312417616 -0.0111356555
## cons_social             0.081641183      -0.0230696113  0.1745273254
## cons_other              0.102185717      -0.0234748189  0.1081096648
## ent_nonag_revenue      -0.021059473      -0.0395121110  0.0030234086
## ent_nonag_flowcost     -0.012308726      -0.0176305685  0.0040372210
## ent_total_cost         -0.004958705      -0.0153623822  0.0059919905
## fs_adwholed_often       0.382255462       0.1360304798  0.0091447394
## fs_chwholed_often       0.184778164      -0.0044176570  0.0060573727
## fs_meat                -0.121690697       0.0032124566  0.0042299197
## fs_enoughtom           -0.120834477      -0.0755126046  0.0164768184
## fs_sleephun             1.000000000       0.1109367855  0.0569282281
## med_sickdays_hhave      0.110936786       1.0000000000 -0.0005615704
## med_u5_deaths           0.056928228      -0.0005615704  1.0000000000
## ed_expenses             0.109207431      -0.0394202498 -0.0032002959
## ed_schoolattend         0.267136054      -0.1183477689  0.0799009697
## durable_investment      0.031104957      -0.0800212211  0.0268125273
## nondurable_investment  -0.004177845      -0.0333406914 -0.0022274803
## depression              0.017246050       0.0604360210 -0.0143394154
##                         ed_expenses ed_schoolattend durable_investment
## sex                     0.048682509     0.158867156        0.109633619
## age                     0.107937400     0.010671546       -0.009946984
## marital_status          0.033910593     0.045965077        0.132005951
## children                0.169341242     0.256081444        0.072505188
## household_size          0.206373527     0.279405402        0.119928720
## years_of_edu            0.016080849     0.041076636        0.093549064
## hh_children             0.300511514     0.667559294        0.378153225
## cons_nondurable         0.291615160     0.465165658        0.492244105
## asset_durable           0.159059724     0.379995753        0.776090343
## asset_phone             0.291164663     0.348764754        0.406440378
## asset_savings           0.167918539     0.036464076        0.257011353
## asset_land_owned_total  0.100822898     0.200088550        0.255768780
## cons_allfood            0.218406274     0.426321774        0.457799192
## cons_alcohol            0.023067379     0.052493236        0.039463321
## cons_tobacco            0.091397857     0.088271253        0.042385794
## cons_med_total          0.063732626     0.115433356        0.080877921
## cons_med_children       0.066052971     0.060352847        0.020398767
## cons_ed                 0.932086917     0.261744320        0.211327294
## cons_social             0.207664428     0.234738494        0.348384149
## cons_other              0.188475100     0.367791765        0.380047503
## ent_nonag_revenue       0.030952045     0.084188394        0.094004667
## ent_nonag_flowcost      0.061693592     0.120118431        0.135622413
## ent_total_cost          0.076351472     0.140397312        0.174641704
## fs_adwholed_often       0.090672520     0.142890603       -0.014080963
## fs_chwholed_often       0.053368945     0.102065060       -0.064374306
## fs_meat                -0.044361226    -0.006065066        0.048877976
## fs_enoughtom            0.014331494     0.162290679        0.255206553
## fs_sleephun             0.109207431     0.267136054        0.031104957
## med_sickdays_hhave     -0.039420250    -0.118347769       -0.080021221
## med_u5_deaths          -0.003200296     0.079900970        0.026812527
## ed_expenses             1.000000000     0.305664668        0.206323141
## ed_schoolattend         0.305664668     1.000000000        0.386852990
## durable_investment      0.206323141     0.386852990        1.000000000
## nondurable_investment   0.215640856     0.144158551        0.298216237
## depression              0.009521442    -0.004883243       -0.017430584
##                        nondurable_investment   depression
## sex                              0.044284961 -0.007864312
## age                             -0.015760895  0.104586878
## marital_status                  -0.012581860 -0.081358216
## children                         0.038588415  0.009313072
## household_size                   0.041612592  0.002057058
## years_of_edu                     0.095695750 -0.126243493
## hh_children                      0.142289845  0.004738743
## cons_nondurable                  0.193600512 -0.023045708
## asset_durable                    0.294971338 -0.047979393
## asset_phone                      0.243680181 -0.038934311
## asset_savings                    0.648200858 -0.011961091
## asset_land_owned_total           0.040695673  0.010961241
## cons_allfood                     0.150219929 -0.023505646
## cons_alcohol                    -0.010906572  0.042121172
## cons_tobacco                     0.011605684  0.007517433
## cons_med_total                   0.032056115  0.022115055
## cons_med_children                0.010385155  0.012730428
## cons_ed                          0.209248308  0.005230865
## cons_social                      0.195687124 -0.017874286
## cons_other                       0.223207146 -0.035371066
## ent_nonag_revenue                0.439267596 -0.013235113
## ent_nonag_flowcost               0.784985236 -0.035796973
## ent_total_cost                   0.794813756 -0.037822772
## fs_adwholed_often                0.003719725  0.145022197
## fs_chwholed_often               -0.037660830  0.066157458
## fs_meat                          0.047627091  0.017715394
## fs_enoughtom                     0.092933696 -0.020888139
## fs_sleephun                     -0.004177845  0.017246050
## med_sickdays_hhave              -0.033340691  0.060436021
## med_u5_deaths                   -0.002227480 -0.014339415
## ed_expenses                      0.215640856  0.009521442
## ed_schoolattend                  0.144158551 -0.004883243
## durable_investment               0.298216237 -0.017430584
## nondurable_investment            1.000000000 -0.035400756
## depression                      -0.035400756  1.000000000

3.2 Explore the data

3.2.1 Box plot and Bar plot.

4.3 Up Sampling

4.4 Down Sampling

5.Model applications and Model Comparison

5.1 imbalanced dataset

5.1.1 Logistic Model

## [1] "Number of rows: 1147"
## [1] "Number of columns: 35"
## [1] "Number of missing values: 0"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 204  36
##          1   2   1
##                                           
##                Accuracy : 0.8436          
##                  95% CI : (0.7917, 0.8869)
##     No Information Rate : 0.8477          
##     P-Value [Acc > NIR] : 0.6129          
##                                           
##                   Kappa : 0.0278          
##                                           
##  Mcnemar's Test P-Value : 8.636e-08       
##                                           
##             Sensitivity : 0.99029         
##             Specificity : 0.02703         
##          Pos Pred Value : 0.85000         
##          Neg Pred Value : 0.33333         
##              Prevalence : 0.84774         
##          Detection Rate : 0.83951         
##    Detection Prevalence : 0.98765         
##       Balanced Accuracy : 0.50866         
##                                           
##        'Positive' Class : 0               
## 

5.1.1 Summary of Logistic model on imbalanced dataset

This logistic regression model performs well, has a nice accuracy (around 84.36%), but performs poorly in terms of its ability to distinguish between two categories (specifically, the positive category, i.e., 1). The accuracy of the model is similar to the No Information Rate (NIR), which indicates that the model does not significantly outperform random guessing in terms of predictive power. In addition, the model’s sensitivity in the positive category is extremely low and the specificity is very bad.

Accuracy: The accuracy is 84.36%, but this is because the imbalance in the data (most of the data belongs to category 0).

Positive Predictive Value and Negative Predictive Value: PPV is 85%, indicating that when the model predicts a positive category, the probability of being correct is 85%. BUT the NPV is very low which is 33.33%, which means the NP is very low. The model cannot distinguish the class ‘1’.

Sensitivity and Specificity: Sensitivity is as high as 99%, but Specificity is only 2.70%. This indicates that the model hardly recognizes the positive class correctly (CLass=1).

Kappa statistic: The Kappa value is only 0.0278 which indicates that the model has poor predictive power.

Mcnemar’s Test P-value: 8.636e-08, indicating that the model has significant bias in predicting positive and negative classes.

Balanced Accuracy: 50.87%, which further emphasizes the inadequacy of the model in handling unbalanced datasets.

5.1.2 Weighted Logistic model

## [1] "Number of rows: 1147"
## [1] "Number of columns: 35"
## [1] "Number of missing values: 0"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 108  12
##          1  29  13
##                                           
##                Accuracy : 0.7469          
##                  95% CI : (0.6727, 0.8119)
##     No Information Rate : 0.8457          
##     P-Value [Acc > NIR] : 0.99961         
##                                           
##                   Kappa : 0.2413          
##                                           
##  Mcnemar's Test P-Value : 0.01246         
##                                           
##             Sensitivity : 0.7883          
##             Specificity : 0.5200          
##          Pos Pred Value : 0.9000          
##          Neg Pred Value : 0.3095          
##              Prevalence : 0.8457          
##          Detection Rate : 0.6667          
##    Detection Prevalence : 0.7407          
##       Balanced Accuracy : 0.6542          
##                                           
##        'Positive' Class : 0               
##                                           
## [1] "AIC: 1238.59469568419"

5.1.2 Summary of Weighted Logistic model on imbalanced dataset

Accuracy: 74.69%, which is less than last one.

Predictive Ability: The model has a high prediction accuracy for the positive category (90%) but performs poorly for the negative category (NPV is only 30.95%), indicating that the model is less reliable in predicting the negative category. So this result is the similar to last one because of the imbalanced data.

Balanced Accuracy: 0.6542, indicating that the model is not very well when it encounters imbalanced dataset. Especially encoutner the negative class (class=1).

Mcnemar’s Test P-Value: 0.01246, indicating that the model has high bias when it deal with different classes.

Kappa: 0.2413, indicating that this model is much better than last one. Because this one get the weight into class ‘1’.

5.1.3 Decision Tree model

## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
## [1] "Number of rows: 1147"
## [1] "Number of columns: 35"
## [1] "Number of missing values: 0"

## Call:
## rpart(formula = depressed ~ ., data = train_data, method = "class")
##   n= 570 
## 
##           CP nsplit rel error   xerror       xstd
## 1 0.03030303      0 1.0000000 1.000000 0.09802678
## 2 0.02272727      3 0.9090909 1.079545 0.10110870
## 3 0.01136364      5 0.8636364 1.181818 0.10478269
## 4 0.01000000      6 0.8522727 1.261364 0.10743548
## 
## Variable importance
##           hh_children          years_of_edu    med_sickdays_hhave 
##                    14                    12                    11 
##           cons_social     fs_adwholed_often            cons_other 
##                    10                    10                     8 
##       cons_nondurable    durable_investment                   age 
##                     5                     5                     4 
##          cons_allfood nondurable_investment         asset_durable 
##                     4                     3                     3 
##     fs_chwholed_often        household_size         asset_savings 
##                     3                     2                     1 
##           fs_sleephun               fs_meat 
##                     1                     1 
## 
## Node number 1: 570 observations,    complexity param=0.03030303
##   predicted class=0  expected loss=0.154386  P(node) =1
##     class counts:   482    88
##    probabilities: 0.846 0.154 
##   left son=2 (477 obs) right son=3 (93 obs)
##   Primary splits:
##       years_of_edu      < 6.5        to the right, improve=4.107167, (0 missing)
##       fs_adwholed_often < 2          to the left,  improve=3.973162, (0 missing)
##       children          < 0.5        to the right, improve=2.285400, (0 missing)
##       age               < 66.5       to the left,  improve=2.021110, (0 missing)
##       asset_phone       < 15.93529   to the right, improve=1.665369, (0 missing)
##   Surrogate splits:
##       age               < 56.5       to the left,  agree=0.870, adj=0.204, (0 split)
##       household_size    < 1.5        to the right, agree=0.840, adj=0.022, (0 split)
##       cons_social       < 24.08979   to the left,  agree=0.840, adj=0.022, (0 split)
##       children          < 0.5        to the right, agree=0.839, adj=0.011, (0 split)
##       cons_med_children < 12.8123    to the left,  agree=0.839, adj=0.011, (0 split)
## 
## Node number 2: 477 observations,    complexity param=0.02272727
##   predicted class=0  expected loss=0.1278826  P(node) =0.8368421
##     class counts:   416    61
##    probabilities: 0.872 0.128 
##   left son=4 (253 obs) right son=5 (224 obs)
##   Primary splits:
##       cons_social        < 0.7273647  to the right, improve=1.8047670, (0 missing)
##       durable_investment < 234.8594   to the right, improve=1.6072560, (0 missing)
##       asset_phone        < 63.26071   to the left,  improve=1.3022130, (0 missing)
##       ed_expenses        < 45.6438    to the left,  improve=0.8842878, (0 missing)
##       fs_adwholed_often  < 2          to the left,  improve=0.8533597, (0 missing)
##   Surrogate splits:
##       cons_nondurable       < 46.84868   to the right, agree=0.878, adj=0.741, (0 split)
##       cons_other            < 3.483343   to the right, agree=0.878, adj=0.741, (0 split)
##       durable_investment    < 36.43497   to the right, agree=0.878, adj=0.741, (0 split)
##       nondurable_investment < 0.02780446 to the right, agree=0.878, adj=0.741, (0 split)
##       asset_durable         < 11.29084   to the right, agree=0.876, adj=0.737, (0 split)
## 
## Node number 3: 93 observations,    complexity param=0.03030303
##   predicted class=0  expected loss=0.2903226  P(node) =0.1631579
##     class counts:    66    27
##    probabilities: 0.710 0.290 
##   left son=6 (69 obs) right son=7 (24 obs)
##   Primary splits:
##       fs_adwholed_often     < 2          to the left,  improve=4.087073, (0 missing)
##       cons_social           < 1.621556   to the left,  improve=3.937965, (0 missing)
##       nondurable_investment < 4.569942   to the left,  improve=2.632348, (0 missing)
##       cons_alcohol          < 0.587257   to the right, improve=2.547581, (0 missing)
##       ent_total_cost        < 4.771246   to the left,  improve=2.439231, (0 missing)
##   Surrogate splits:
##       fs_chwholed_often < 2          to the left,  agree=0.785, adj=0.167, (0 split)
##       fs_sleephun       < 0.5        to the left,  agree=0.774, adj=0.125, (0 split)
##       fs_meat           < 0.5        to the right, agree=0.763, adj=0.083, (0 split)
##       cons_med_total    < 1.040999   to the left,  agree=0.753, adj=0.042, (0 split)
##       cons_ed           < 10.2098    to the left,  agree=0.753, adj=0.042, (0 split)
## 
## Node number 4: 253 observations
##   predicted class=0  expected loss=0.08695652  P(node) =0.4438596
##     class counts:   231    22
##    probabilities: 0.913 0.087 
## 
## Node number 5: 224 observations,    complexity param=0.02272727
##   predicted class=0  expected loss=0.1741071  P(node) =0.3929825
##     class counts:   185    39
##    probabilities: 0.826 0.174 
##   left son=10 (214 obs) right son=11 (10 obs)
##   Primary splits:
##       hh_children       < 4.5        to the left,  improve=5.789736, (0 missing)
##       fs_adwholed_often < 2          to the left,  improve=4.149148, (0 missing)
##       cons_ed           < 1.287903   to the left,  improve=3.827600, (0 missing)
##       ed_expenses       < 15.45483   to the left,  improve=3.535107, (0 missing)
##       cons_med_children < 0.4003842  to the right, improve=2.183618, (0 missing)
##   Surrogate splits:
##       asset_savings      < 28.82767   to the left,  agree=0.96, adj=0.1, (0 split)
##       fs_chwholed_often  < 2          to the left,  agree=0.96, adj=0.1, (0 split)
##       durable_investment < 843.3353   to the left,  agree=0.96, adj=0.1, (0 split)
## 
## Node number 6: 69 observations,    complexity param=0.01136364
##   predicted class=0  expected loss=0.2028986  P(node) =0.1210526
##     class counts:    55    14
##    probabilities: 0.797 0.203 
##   left son=12 (62 obs) right son=13 (7 obs)
##   Primary splits:
##       cons_other      < 34.5932    to the left,  improve=2.1160760, (0 missing)
##       cons_alcohol    < 0.587257   to the right, improve=0.9629084, (0 missing)
##       children        < 1.5        to the left,  improve=0.8641367, (0 missing)
##       years_of_edu    < 5.5        to the right, improve=0.8101365, (0 missing)
##       cons_nondurable < 241.4361   to the left,  improve=0.7934950, (0 missing)
##   Surrogate splits:
##       cons_nondurable < 266.7672   to the left,  agree=0.928, adj=0.286, (0 split)
##       cons_social     < 9.142107   to the left,  agree=0.928, adj=0.286, (0 split)
##       cons_allfood    < 215.6241   to the left,  agree=0.913, adj=0.143, (0 split)
## 
## Node number 7: 24 observations,    complexity param=0.03030303
##   predicted class=1  expected loss=0.4583333  P(node) =0.04210526
##     class counts:    11    13
##    probabilities: 0.458 0.542 
##   left son=14 (14 obs) right son=15 (10 obs)
##   Primary splits:
##       med_sickdays_hhave < 1.525      to the left,  improve=4.402381, (0 missing)
##       cons_social        < 1.354633   to the left,  improve=2.938889, (0 missing)
##       ed_schoolattend    < 0.8571429  to the right, improve=2.937646, (0 missing)
##       years_of_edu       < 4.5        to the left,  improve=2.288095, (0 missing)
##       ent_total_cost     < 4.771246   to the left,  improve=1.399184, (0 missing)
##   Surrogate splits:
##       cons_social    < 3.203074   to the left,  agree=0.750, adj=0.4, (0 split)
##       cons_allfood   < 32.61111   to the right, agree=0.708, adj=0.3, (0 split)
##       age            < 66         to the left,  agree=0.667, adj=0.2, (0 split)
##       household_size < 1.5        to the right, agree=0.667, adj=0.2, (0 split)
##       years_of_edu   < 3.5        to the left,  agree=0.667, adj=0.2, (0 split)
## 
## Node number 10: 214 observations
##   predicted class=0  expected loss=0.1495327  P(node) =0.3754386
##     class counts:   182    32
##    probabilities: 0.850 0.150 
## 
## Node number 11: 10 observations
##   predicted class=1  expected loss=0.3  P(node) =0.01754386
##     class counts:     3     7
##    probabilities: 0.300 0.700 
## 
## Node number 12: 62 observations
##   predicted class=0  expected loss=0.1612903  P(node) =0.1087719
##     class counts:    52    10
##    probabilities: 0.839 0.161 
## 
## Node number 13: 7 observations
##   predicted class=1  expected loss=0.4285714  P(node) =0.0122807
##     class counts:     3     4
##    probabilities: 0.429 0.571 
## 
## Node number 14: 14 observations
##   predicted class=0  expected loss=0.2857143  P(node) =0.0245614
##     class counts:    10     4
##    probabilities: 0.714 0.286 
## 
## Node number 15: 10 observations
##   predicted class=1  expected loss=0.1  P(node) =0.01754386
##     class counts:     1     9
##    probabilities: 0.100 0.900 
## 
## n= 570 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 570 88 0 (0.84561404 0.15438596)  
##    2) years_of_edu>=6.5 477 61 0 (0.87211740 0.12788260)  
##      4) cons_social>=0.7273647 253 22 0 (0.91304348 0.08695652) *
##      5) cons_social< 0.7273647 224 39 0 (0.82589286 0.17410714)  
##       10) hh_children< 4.5 214 32 0 (0.85046729 0.14953271) *
##       11) hh_children>=4.5 10  3 1 (0.30000000 0.70000000) *
##    3) years_of_edu< 6.5 93 27 0 (0.70967742 0.29032258)  
##      6) fs_adwholed_often< 2 69 14 0 (0.79710145 0.20289855)  
##       12) cons_other< 34.5932 62 10 0 (0.83870968 0.16129032) *
##       13) cons_other>=34.5932 7  3 1 (0.42857143 0.57142857) *
##      7) fs_adwholed_often>=2 24 11 1 (0.45833333 0.54166667)  
##       14) med_sickdays_hhave< 1.525 14  4 0 (0.71428571 0.28571429) *
##       15) med_sickdays_hhave>=1.525 10  1 1 (0.10000000 0.90000000) *
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 199  36
##          1   7   1
##                                           
##                Accuracy : 0.823           
##                  95% CI : (0.7691, 0.8689)
##     No Information Rate : 0.8477          
##     P-Value [Acc > NIR] : 0.8759          
##                                           
##                   Kappa : -0.0102         
##                                           
##  Mcnemar's Test P-Value : 1.955e-05       
##                                           
##             Sensitivity : 0.96602         
##             Specificity : 0.02703         
##          Pos Pred Value : 0.84681         
##          Neg Pred Value : 0.12500         
##              Prevalence : 0.84774         
##          Detection Rate : 0.81893         
##    Detection Prevalence : 0.96708         
##       Balanced Accuracy : 0.49652         
##                                           
##        'Positive' Class : 0               
## 
## $model
## n= 570 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 570 88 0 (0.84561404 0.15438596)  
##    2) years_of_edu>=6.5 477 61 0 (0.87211740 0.12788260)  
##      4) cons_social>=0.7273647 253 22 0 (0.91304348 0.08695652) *
##      5) cons_social< 0.7273647 224 39 0 (0.82589286 0.17410714)  
##       10) hh_children< 4.5 214 32 0 (0.85046729 0.14953271) *
##       11) hh_children>=4.5 10  3 1 (0.30000000 0.70000000) *
##    3) years_of_edu< 6.5 93 27 0 (0.70967742 0.29032258)  
##      6) fs_adwholed_often< 2 69 14 0 (0.79710145 0.20289855)  
##       12) cons_other< 34.5932 62 10 0 (0.83870968 0.16129032) *
##       13) cons_other>=34.5932 7  3 1 (0.42857143 0.57142857) *
##      7) fs_adwholed_often>=2 24 11 1 (0.45833333 0.54166667)  
##       14) med_sickdays_hhave< 1.525 14  4 0 (0.71428571 0.28571429) *
##       15) med_sickdays_hhave>=1.525 10  1 1 (0.10000000 0.90000000) *
## 
## $confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 199  36
##          1   7   1
##                                           
##                Accuracy : 0.823           
##                  95% CI : (0.7691, 0.8689)
##     No Information Rate : 0.8477          
##     P-Value [Acc > NIR] : 0.8759          
##                                           
##                   Kappa : -0.0102         
##                                           
##  Mcnemar's Test P-Value : 1.955e-05       
##                                           
##             Sensitivity : 0.96602         
##             Specificity : 0.02703         
##          Pos Pred Value : 0.84681         
##          Neg Pred Value : 0.12500         
##              Prevalence : 0.84774         
##          Detection Rate : 0.81893         
##    Detection Prevalence : 0.96708         
##       Balanced Accuracy : 0.49652         
##                                           
##        'Positive' Class : 0               
## 

5.1.3 Summary of Decision Tree model on imbalanced dataset

Based on the decision tree model and confusion matrix data, the model mainly predicted category 0 (non-depressed), but performed poorly for category 1 (depressed).(the similar result as above some models).

Accuracy: 82.3%, the model is highly accurate

Kappa statistic: Kappa is negative, indicating that the model has poor predictive power.

Sensitivity and specificity: the sensitivity was high (96.6%), indicating that the model was able to identify individuals with non-depressive symptoms well; however, the specificity was extremely low (2.7%), indicating that it was almost impossible to correctly identify individuals with true depressive symptoms.

Positive and negative predictive values: the positive predictive value is 84.68%, but the negative predictive value is 12.5%, indicating the model’s poor ability to predict class ‘1’.

5.2 Balanced dataset

5.2.1 Logistic model on balanced dataset (Upsampling)

## [1] "Number of rows: 1147"
## [1] "Number of columns: 35"
## [1] "Number of missing values: 0"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 142  86
##          1  64 120
##                                           
##                Accuracy : 0.6359          
##                  95% CI : (0.5874, 0.6825)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 1.884e-08       
##                                           
##                   Kappa : 0.2718          
##                                           
##  Mcnemar's Test P-Value : 0.08641         
##                                           
##             Sensitivity : 0.6893          
##             Specificity : 0.5825          
##          Pos Pred Value : 0.6228          
##          Neg Pred Value : 0.6522          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3447          
##    Detection Prevalence : 0.5534          
##       Balanced Accuracy : 0.6359          
##                                           
##        'Positive' Class : 0               
## 

5.2.1 Summary of Logistic model on balanced dataset

Summarizing the model performance: The overall performance of the model was good with an accuracy of 63.59% . But the predictive ability is relatively relatively good in unbalanced data. But the sensitivity and specificity performance is not good.

No Information Rate (NIR): 0.5, indicating balanced data categories.

P-Value [Acc > NIR]: 1.884e-08. This very small p-value indicates that the model is significantly more accurate than the prediction rate without any information.

Kappa: 0.2718, indicating that the model has some predictive power.

Mcnemar’s Test P-Value: 0.08641. This value is greater than 0.05, indicating that there is no significant bias between predicting positive and negative categories.

Sensitivity: 68.93%. This means that the model correctly identifies 68.93% of the actual positive classes, indicating that the model performs well in identifying actual positive classes.

Specificity: 58.25%. This means that the model correctly identifies 58.25% of the actual negative classes, indicating that the model has improved its performance relative to the previous model.

Pos Pred Value, PPV: 62.28%. This is the percentage of predicted positive categories that are actually positive, indicating that when the model predicts a sample to be positive, there is a 62.28% probability that it will be correct.

Neg Pred Value, NPV: 65.22%. This is the percentage of predicted negative categories that are actually negative, indicating that when the model predicts a sample to be negative, there is a 65.22% probability of being correct.

Balanced Accuracy: 63.59%. This is the average of the sensitivity and specificity. It indicates that the model is equally capable of predicting both categories.

5.2.2 Decision Tree model on balanced dataset (Upsampling)

## [1] "Number of rows: 1147"
## [1] "Number of columns: 35"
## [1] "Number of missing values: 0"

## Call:
## rpart(formula = depressed ~ ., data = train_data, method = "class")
##   n= 964 
## 
##            CP nsplit rel error    xerror       xstd
## 1  0.15560166      0 1.0000000 1.0788382 0.03210758
## 2  0.10788382      1 0.8443983 0.9107884 0.03207941
## 3  0.03941909      2 0.7365145 0.7842324 0.03144917
## 4  0.01867220      3 0.6970954 0.7406639 0.03110591
## 5  0.01659751      5 0.6597510 0.7385892 0.03108789
## 6  0.01556017      8 0.6058091 0.6867220 0.03058654
## 7  0.01452282     15 0.4854772 0.6867220 0.03058654
## 8  0.01244813     16 0.4709544 0.6473029 0.03013808
## 9  0.01175657     17 0.4585062 0.6307054 0.02993114
## 10 0.01000000     20 0.4232365 0.6078838 0.02962849
## 
## Variable importance
##           cons_social       cons_nondurable            cons_other 
##                    10                     8                     7 
##         asset_durable          cons_allfood     cons_med_children 
##                     6                     6                     6 
##    durable_investment     fs_chwholed_often    med_sickdays_hhave 
##                     5                     5                     5 
##               fs_meat              children           asset_phone 
##                     5                     4                     4 
##     fs_adwholed_often        household_size        ent_total_cost 
##                     4                     4                     3 
##        marital_status                   age nondurable_investment 
##                     3                     3                     3 
##           fs_sleephun          years_of_edu          cons_alcohol 
##                     2                     2                     2 
##        cons_med_total          cons_tobacco         asset_savings 
##                     1                     1                     1 
## 
## Node number 1: 964 observations,    complexity param=0.1556017
##   predicted class=0  expected loss=0.5  P(node) =1
##     class counts:   482   482
##    probabilities: 0.500 0.500 
##   left son=2 (755 obs) right son=3 (209 obs)
##   Primary splits:
##       fs_adwholed_often  < 2          to the left,  improve=17.18210, (0 missing)
##       med_sickdays_hhave < 1.469298   to the left,  improve=15.26727, (0 missing)
##       cons_social        < 3.22643    to the right, improve=14.05538, (0 missing)
##       durable_investment < 282.9833   to the right, improve=12.82277, (0 missing)
##       years_of_edu       < 6.5        to the right, improve=12.75755, (0 missing)
##   Surrogate splits:
##       fs_sleephun       < 0.5        to the left,  agree=0.859, adj=0.349, (0 split)
##       fs_chwholed_often < 0.9384058  to the left,  agree=0.856, adj=0.335, (0 split)
##       cons_social       < 20.57975   to the left,  agree=0.793, adj=0.043, (0 split)
##       cons_med_total    < 29.62843   to the left,  agree=0.789, adj=0.029, (0 split)
##       asset_savings     < 94.49069   to the left,  agree=0.785, adj=0.010, (0 split)
## 
## Node number 2: 755 observations,    complexity param=0.1078838
##   predicted class=0  expected loss=0.4503311  P(node) =0.783195
##     class counts:   415   340
##    probabilities: 0.550 0.450 
##   left son=4 (357 obs) right son=5 (398 obs)
##   Primary splits:
##       cons_social           < 0.08007685 to the right, improve=22.26146, (0 missing)
##       fs_meat               < 3.03444    to the left,  improve=21.27456, (0 missing)
##       nondurable_investment < 0.3447753  to the right, improve=17.67330, (0 missing)
##       ent_total_cost        < 0.07785249 to the right, improve=16.71426, (0 missing)
##       cons_nondurable       < 13.62031   to the right, improve=16.65899, (0 missing)
##   Surrogate splits:
##       asset_durable      < 12.0916    to the right, agree=0.959, adj=0.913, (0 split)
##       durable_investment < 20.90006   to the right, agree=0.958, adj=0.910, (0 split)
##       cons_nondurable    < 13.62031   to the right, agree=0.956, adj=0.908, (0 split)
##       cons_allfood       < 2.616797   to the right, agree=0.956, adj=0.908, (0 split)
##       cons_other         < 0.4804611  to the right, agree=0.955, adj=0.905, (0 split)
## 
## Node number 3: 209 observations,    complexity param=0.01659751
##   predicted class=1  expected loss=0.3205742  P(node) =0.216805
##     class counts:    67   142
##    probabilities: 0.321 0.679 
##   left son=6 (108 obs) right son=7 (101 obs)
##   Primary splits:
##       med_sickdays_hhave < 1.775      to the left,  improve=10.279040, (0 missing)
##       durable_investment < 283.7841   to the right, improve= 8.556608, (0 missing)
##       cons_other         < 44.05828   to the right, improve= 6.367930, (0 missing)
##       ent_total_cost     < 0.6517366  to the left,  improve= 5.469878, (0 missing)
##       years_of_edu       < 8.5        to the right, improve= 4.869325, (0 missing)
##   Surrogate splits:
##       cons_social           < 2.462363   to the left,  agree=0.646, adj=0.267, (0 split)
##       cons_med_total        < 14.01345   to the left,  agree=0.622, adj=0.218, (0 split)
##       cons_med_children     < 0.8808453  to the left,  agree=0.622, adj=0.218, (0 split)
##       marital_status        < 0.5        to the right, agree=0.617, adj=0.208, (0 split)
##       nondurable_investment < 40.25419   to the left,  agree=0.612, adj=0.198, (0 split)
## 
## Node number 4: 357 observations,    complexity param=0.03941909
##   predicted class=0  expected loss=0.3221289  P(node) =0.370332
##     class counts:   242   115
##    probabilities: 0.678 0.322 
##   left son=8 (284 obs) right son=9 (73 obs)
##   Primary splits:
##       asset_phone            < 63.26071   to the left,  improve=17.411140, (0 missing)
##       fs_meat                < 4.5        to the left,  improve=10.299610, (0 missing)
##       asset_land_owned_total < 1.3        to the right, improve= 8.888106, (0 missing)
##       asset_durable          < 523.8868   to the left,  improve= 7.820506, (0 missing)
##       cons_med_total         < 12.33183   to the left,  improve= 6.885786, (0 missing)
##   Surrogate splits:
##       cons_med_children     < 10.57014   to the left,  agree=0.821, adj=0.123, (0 split)
##       asset_durable         < 584.561    to the left,  agree=0.812, adj=0.082, (0 split)
##       nondurable_investment < 0.3447753  to the right, agree=0.812, adj=0.082, (0 split)
##       cons_nondurable       < 423.3387   to the left,  agree=0.810, adj=0.068, (0 split)
##       cons_other            < 77.35424   to the left,  agree=0.810, adj=0.068, (0 split)
## 
## Node number 5: 398 observations,    complexity param=0.01556017
##   predicted class=1  expected loss=0.4346734  P(node) =0.4128631
##     class counts:   173   225
##    probabilities: 0.435 0.565 
##   left son=10 (312 obs) right son=11 (86 obs)
##   Primary splits:
##       fs_chwholed_often  < 0.112069   to the right, improve=8.963033, (0 missing)
##       children           < 0.5        to the right, improve=7.398128, (0 missing)
##       cons_med_children  < 0.6780793  to the right, improve=7.359677, (0 missing)
##       marital_status     < 0.5        to the right, improve=7.297268, (0 missing)
##       med_sickdays_hhave < 3.75       to the left,  improve=5.508631, (0 missing)
##   Surrogate splits:
##       cons_med_children < 0.1501441  to the right, agree=0.987, adj=0.942, (0 split)
##       children          < 0.5        to the right, agree=0.925, adj=0.651, (0 split)
##       household_size    < 2.5        to the right, agree=0.894, adj=0.512, (0 split)
##       cons_nondurable   < 14.34405   to the left,  agree=0.862, adj=0.360, (0 split)
##       asset_durable     < 2.882766   to the left,  agree=0.862, adj=0.360, (0 split)
## 
## Node number 6: 108 observations,    complexity param=0.01659751
##   predicted class=1  expected loss=0.4722222  P(node) =0.1120332
##     class counts:    51    57
##    probabilities: 0.472 0.528 
##   left son=12 (16 obs) right son=13 (92 obs)
##   Primary splits:
##       cons_social    < 4.170669   to the right, improve=10.463770, (0 missing)
##       marital_status < 0.5        to the left,  improve= 8.233333, (0 missing)
##       ent_total_cost < 0.8713918  to the left,  improve= 7.724959, (0 missing)
##       fs_meat        < 3.5        to the right, improve= 6.699595, (0 missing)
##       asset_durable  < 61.01856   to the right, improve= 6.519048, (0 missing)
##   Surrogate splits:
##       cons_med_children  < 2.001921   to the right, agree=0.880, adj=0.188, (0 split)
##       cons_other         < 44.37859   to the right, agree=0.880, adj=0.188, (0 split)
##       med_sickdays_hhave < 1.669643   to the right, agree=0.880, adj=0.188, (0 split)
##       cons_med_total     < 2.322229   to the right, agree=0.870, adj=0.125, (0 split)
##       years_of_edu       < 11         to the right, agree=0.861, adj=0.063, (0 split)
## 
## Node number 7: 101 observations
##   predicted class=1  expected loss=0.1584158  P(node) =0.1047718
##     class counts:    16    85
##    probabilities: 0.158 0.842 
## 
## Node number 8: 284 observations,    complexity param=0.01175657
##   predicted class=0  expected loss=0.2429577  P(node) =0.2946058
##     class counts:   215    69
##    probabilities: 0.757 0.243 
##   left son=16 (117 obs) right son=17 (167 obs)
##   Primary splits:
##       ent_total_cost        < 3.359891   to the left,  improve=8.827632, (0 missing)
##       nondurable_investment < 3.705779   to the left,  improve=8.730911, (0 missing)
##       ent_nonag_revenue     < 333.1197   to the left,  improve=5.414585, (0 missing)
##       ed_expenses           < 15.61499   to the left,  improve=5.241596, (0 missing)
##       years_of_edu          < 5.5        to the right, improve=5.046073, (0 missing)
##   Surrogate splits:
##       nondurable_investment < 7.223599   to the left,  agree=0.866, adj=0.675, (0 split)
##       cons_nondurable       < 129.3255   to the left,  agree=0.683, adj=0.231, (0 split)
##       cons_other            < 13.70115   to the left,  agree=0.662, adj=0.179, (0 split)
##       durable_investment    < 140.7668   to the left,  agree=0.651, adj=0.154, (0 split)
##       cons_allfood          < 85.47331   to the left,  agree=0.648, adj=0.145, (0 split)
## 
## Node number 9: 73 observations,    complexity param=0.0186722
##   predicted class=1  expected loss=0.369863  P(node) =0.07572614
##     class counts:    27    46
##    probabilities: 0.370 0.630 
##   left son=18 (29 obs) right son=19 (44 obs)
##   Primary splits:
##       fs_meat      < 3.5        to the left,  improve=7.833040, (0 missing)
##       cons_social  < 3.269805   to the right, improve=6.299506, (0 missing)
##       fs_sleephun  < 0.5        to the right, improve=4.150348, (0 missing)
##       age          < 29.5       to the right, improve=4.109722, (0 missing)
##       years_of_edu < 9.5        to the left,  improve=3.661097, (0 missing)
##   Surrogate splits:
##       cons_other      < 24.34336   to the left,  agree=0.836, adj=0.586, (0 split)
##       fs_sleephun     < 0.5        to the right, agree=0.767, adj=0.414, (0 split)
##       cons_allfood    < 137.5272   to the left,  agree=0.753, adj=0.379, (0 split)
##       cons_nondurable < 92.18123   to the left,  agree=0.740, adj=0.345, (0 split)
##       asset_savings   < 3.203074   to the left,  agree=0.726, adj=0.310, (0 split)
## 
## Node number 10: 312 observations,    complexity param=0.01556017
##   predicted class=1  expected loss=0.4903846  P(node) =0.3236515
##     class counts:   153   159
##    probabilities: 0.490 0.510 
##   left son=20 (10 obs) right son=21 (302 obs)
##   Primary splits:
##       household_size    < 2.5        to the left,  improve=5.366149, (0 missing)
##       children          < 1.5        to the left,  improve=4.720085, (0 missing)
##       age               < 22.5       to the left,  improve=3.436915, (0 missing)
##       marital_status    < 0.5        to the right, improve=2.763880, (0 missing)
##       cons_med_children < 1.525836   to the right, improve=2.550214, (0 missing)
## 
## Node number 11: 86 observations
##   predicted class=1  expected loss=0.2325581  P(node) =0.08921162
##     class counts:    20    66
##    probabilities: 0.233 0.767 
## 
## Node number 12: 16 observations
##   predicted class=0  expected loss=0  P(node) =0.01659751
##     class counts:    16     0
##    probabilities: 1.000 0.000 
## 
## Node number 13: 92 observations,    complexity param=0.01659751
##   predicted class=1  expected loss=0.3804348  P(node) =0.09543568
##     class counts:    35    57
##    probabilities: 0.380 0.620 
##   left son=26 (10 obs) right son=27 (82 obs)
##   Primary splits:
##       marital_status        < 0.5        to the left,  improve=8.613468, (0 missing)
##       cons_allfood          < 65.76883   to the left,  improve=7.987229, (0 missing)
##       ent_total_cost        < 0.8713918  to the left,  improve=6.309825, (0 missing)
##       age                   < 50         to the right, improve=5.816624, (0 missing)
##       nondurable_investment < 1.782266   to the left,  improve=5.816624, (0 missing)
##   Surrogate splits:
##       age                    < 53         to the right, agree=0.935, adj=0.4, (0 split)
##       asset_land_owned_total < 2.495      to the right, agree=0.913, adj=0.2, (0 split)
## 
## Node number 16: 117 observations
##   predicted class=0  expected loss=0.09401709  P(node) =0.1213693
##     class counts:   106    11
##    probabilities: 0.906 0.094 
## 
## Node number 17: 167 observations,    complexity param=0.01175657
##   predicted class=0  expected loss=0.3473054  P(node) =0.1732365
##     class counts:   109    58
##    probabilities: 0.653 0.347 
##   left son=34 (125 obs) right son=35 (42 obs)
##   Primary splits:
##       years_of_edu       < 7.5        to the right, improve=6.898480, (0 missing)
##       cons_nondurable    < 53.34834   to the right, improve=5.580367, (0 missing)
##       durable_investment < 191.5102   to the right, improve=4.503051, (0 missing)
##       ent_total_cost     < 16.04206   to the right, improve=4.358415, (0 missing)
##       asset_phone        < 41.63996   to the right, improve=4.268866, (0 missing)
##   Surrogate splits:
##       cons_nondurable    < 66.13776   to the right, agree=0.802, adj=0.214, (0 split)
##       cons_allfood       < 38.47273   to the right, agree=0.802, adj=0.214, (0 split)
##       age                < 42         to the left,  agree=0.796, adj=0.190, (0 split)
##       cons_tobacco       < 1.014088   to the left,  agree=0.790, adj=0.167, (0 split)
##       durable_investment < 100.6285   to the right, agree=0.784, adj=0.143, (0 split)
## 
## Node number 18: 29 observations,    complexity param=0.0186722
##   predicted class=0  expected loss=0.3448276  P(node) =0.03008299
##     class counts:    19    10
##    probabilities: 0.655 0.345 
##   left son=36 (18 obs) right son=37 (11 obs)
##   Primary splits:
##       fs_meat       < 1.5        to the right, improve=11.285270, (0 missing)
##       cons_social   < 2.602498   to the right, improve= 6.436782, (0 missing)
##       cons_other    < 20.54772   to the right, improve= 5.603448, (0 missing)
##       ed_expenses   < 20.65983   to the left,  improve= 4.214559, (0 missing)
##       asset_durable < 251.7616   to the right, improve= 3.629764, (0 missing)
##   Surrogate splits:
##       cons_social        < 2.602498   to the right, agree=0.862, adj=0.636, (0 split)
##       cons_nondurable    < 84.25133   to the right, agree=0.759, adj=0.364, (0 split)
##       asset_durable      < 153.0269   to the right, agree=0.759, adj=0.364, (0 split)
##       cons_other         < 20.54772   to the right, agree=0.759, adj=0.364, (0 split)
##       med_sickdays_hhave < 3.925      to the left,  agree=0.759, adj=0.364, (0 split)
## 
## Node number 19: 44 observations
##   predicted class=1  expected loss=0.1818182  P(node) =0.04564315
##     class counts:     8    36
##    probabilities: 0.182 0.818 
## 
## Node number 20: 10 observations
##   predicted class=0  expected loss=0  P(node) =0.01037344
##     class counts:    10     0
##    probabilities: 1.000 0.000 
## 
## Node number 21: 302 observations,    complexity param=0.01556017
##   predicted class=1  expected loss=0.4735099  P(node) =0.313278
##     class counts:   143   159
##    probabilities: 0.474 0.526 
##   left son=42 (244 obs) right son=43 (58 obs)
##   Primary splits:
##       marital_status    < 0.5        to the right, improve=4.672824, (0 missing)
##       age               < 22.5       to the left,  improve=2.490137, (0 missing)
##       children          < 6.5        to the left,  improve=2.361525, (0 missing)
##       fs_chwholed_often < 0.4395492  to the right, improve=1.611502, (0 missing)
##       years_of_edu      < 7.5        to the right, improve=1.274011, (0 missing)
##   Surrogate splits:
##       years_of_edu < 5.5        to the right, agree=0.821, adj=0.069, (0 split)
## 
## Node number 26: 10 observations
##   predicted class=0  expected loss=0  P(node) =0.01037344
##     class counts:    10     0
##    probabilities: 1.000 0.000 
## 
## Node number 27: 82 observations,    complexity param=0.01452282
##   predicted class=1  expected loss=0.304878  P(node) =0.08506224
##     class counts:    25    57
##    probabilities: 0.305 0.695 
##   left son=54 (13 obs) right son=55 (69 obs)
##   Primary splits:
##       ent_total_cost  < 0.8713918  to the left,  improve=6.662452, (0 missing)
##       cons_allfood    < 65.76883   to the left,  improve=5.572647, (0 missing)
##       cons_social     < 1.000961   to the right, improve=4.943337, (0 missing)
##       cons_nondurable < 73.02399   to the left,  improve=3.715162, (0 missing)
##       asset_durable   < 61.01856   to the right, improve=3.637501, (0 missing)
##   Surrogate splits:
##       nondurable_investment < 2.008594   to the left,  agree=0.927, adj=0.538, (0 split)
##       cons_nondurable       < 41.84797   to the left,  agree=0.890, adj=0.308, (0 split)
##       fs_meat               < 0.5        to the left,  agree=0.878, adj=0.231, (0 split)
## 
## Node number 34: 125 observations
##   predicted class=0  expected loss=0.264  P(node) =0.129668
##     class counts:    92    33
##    probabilities: 0.736 0.264 
## 
## Node number 35: 42 observations,    complexity param=0.01175657
##   predicted class=1  expected loss=0.4047619  P(node) =0.04356846
##     class counts:    17    25
##    probabilities: 0.405 0.595 
##   left son=70 (9 obs) right son=71 (33 obs)
##   Primary splits:
##       cons_alcohol   < 0.587257   to the right, improve=8.116883, (0 missing)
##       ent_total_cost < 10.20757   to the right, improve=5.418873, (0 missing)
##       cons_ed        < 2.335575   to the left,  improve=4.132326, (0 missing)
##       cons_tobacco   < 0.6978126  to the right, improve=4.004762, (0 missing)
##       ed_expenses    < 31.87059   to the left,  improve=3.569903, (0 missing)
##   Surrogate splits:
##       cons_tobacco       < 0.6978126  to the right, agree=0.833, adj=0.222, (0 split)
##       cons_ed            < 0.6673071  to the left,  agree=0.833, adj=0.222, (0 split)
##       med_sickdays_hhave < 4.45       to the right, agree=0.833, adj=0.222, (0 split)
##       ed_expenses        < 8.007685   to the left,  agree=0.833, adj=0.222, (0 split)
##       durable_investment < 84.29773   to the left,  agree=0.833, adj=0.222, (0 split)
## 
## Node number 36: 18 observations
##   predicted class=0  expected loss=0  P(node) =0.0186722
##     class counts:    18     0
##    probabilities: 1.000 0.000 
## 
## Node number 37: 11 observations
##   predicted class=1  expected loss=0.09090909  P(node) =0.01141079
##     class counts:     1    10
##    probabilities: 0.091 0.909 
## 
## Node number 42: 244 observations,    complexity param=0.01556017
##   predicted class=0  expected loss=0.4836066  P(node) =0.253112
##     class counts:   126   118
##    probabilities: 0.516 0.484 
##   left son=84 (212 obs) right son=85 (32 obs)
##   Primary splits:
##       children          < 5.5        to the left,  improve=3.062249, (0 missing)
##       age               < 24.5       to the left,  improve=2.760817, (0 missing)
##       fs_chwholed_often < 0.4395492  to the right, improve=2.287265, (0 missing)
##       years_of_edu      < 4.5        to the left,  improve=2.127327, (0 missing)
##       cons_med_children < 2.100913   to the right, improve=1.663826, (0 missing)
##   Surrogate splits:
##       cons_med_children  < 1.136585   to the right, agree=0.963, adj=0.719, (0 split)
##       household_size     < 7.5        to the left,  agree=0.947, adj=0.594, (0 split)
##       med_sickdays_hhave < 1.295687   to the right, agree=0.947, adj=0.594, (0 split)
##       fs_chwholed_often  < 0.5664414  to the left,  agree=0.881, adj=0.094, (0 split)
## 
## Node number 43: 58 observations
##   predicted class=1  expected loss=0.2931034  P(node) =0.06016598
##     class counts:    17    41
##    probabilities: 0.293 0.707 
## 
## Node number 54: 13 observations
##   predicted class=0  expected loss=0.2307692  P(node) =0.01348548
##     class counts:    10     3
##    probabilities: 0.769 0.231 
## 
## Node number 55: 69 observations
##   predicted class=1  expected loss=0.2173913  P(node) =0.07157676
##     class counts:    15    54
##    probabilities: 0.217 0.783 
## 
## Node number 70: 9 observations
##   predicted class=0  expected loss=0  P(node) =0.0093361
##     class counts:     9     0
##    probabilities: 1.000 0.000 
## 
## Node number 71: 33 observations
##   predicted class=1  expected loss=0.2424242  P(node) =0.03423237
##     class counts:     8    25
##    probabilities: 0.242 0.758 
## 
## Node number 84: 212 observations,    complexity param=0.01556017
##   predicted class=0  expected loss=0.4528302  P(node) =0.219917
##     class counts:   116    96
##    probabilities: 0.547 0.453 
##   left son=168 (56 obs) right son=169 (156 obs)
##   Primary splits:
##       household_size     < 5.5        to the right, improve=3.390853, (0 missing)
##       fs_chwholed_often  < 0.4288448  to the right, improve=3.227228, (0 missing)
##       years_of_edu       < 13.5       to the right, improve=2.968799, (0 missing)
##       med_sickdays_hhave < 1.525668   to the left,  improve=2.657614, (0 missing)
##       age                < 31.5       to the right, improve=2.086907, (0 missing)
##   Surrogate splits:
##       children           < 3.5        to the right, agree=0.934, adj=0.750, (0 split)
##       med_sickdays_hhave < 1.525668   to the left,  agree=0.906, adj=0.643, (0 split)
##       fs_chwholed_often  < 0.4288448  to the right, agree=0.877, adj=0.536, (0 split)
##       cons_med_children  < 1.238365   to the left,  agree=0.858, adj=0.464, (0 split)
##       age                < 38.5       to the right, agree=0.811, adj=0.286, (0 split)
## 
## Node number 85: 32 observations
##   predicted class=1  expected loss=0.3125  P(node) =0.03319502
##     class counts:    10    22
##    probabilities: 0.312 0.688 
## 
## Node number 168: 56 observations
##   predicted class=0  expected loss=0.3035714  P(node) =0.05809129
##     class counts:    39    17
##    probabilities: 0.696 0.304 
## 
## Node number 169: 156 observations,    complexity param=0.01556017
##   predicted class=1  expected loss=0.4935897  P(node) =0.1618257
##     class counts:    77    79
##    probabilities: 0.494 0.506 
##   left son=338 (69 obs) right son=339 (87 obs)
##   Primary splits:
##       age               < 24.5       to the left,  improve=4.1560950, (0 missing)
##       children          < 1.5        to the left,  improve=3.0010680, (0 missing)
##       cons_med_children < 2.100913   to the right, improve=3.0010680, (0 missing)
##       fs_chwholed_often < 0.526824   to the right, improve=3.0010680, (0 missing)
##       years_of_edu      < 11.5       to the left,  improve=0.8312297, (0 missing)
##   Surrogate splits:
##       household_size     < 4.5        to the left,  agree=0.628, adj=0.159, (0 split)
##       med_sickdays_hhave < 1.6844     to the right, agree=0.628, adj=0.159, (0 split)
##       years_of_edu       < 8.5        to the left,  agree=0.622, adj=0.145, (0 split)
##       children           < 3.5        to the right, agree=0.583, adj=0.058, (0 split)
##       cons_med_children  < 1.238365   to the left,  agree=0.583, adj=0.058, (0 split)
## 
## Node number 338: 69 observations,    complexity param=0.01244813
##   predicted class=0  expected loss=0.3768116  P(node) =0.07157676
##     class counts:    43    26
##    probabilities: 0.623 0.377 
##   left son=676 (55 obs) right son=677 (14 obs)
##   Primary splits:
##       age                < 17.5       to the right, improve=4.0006020, (0 missing)
##       years_of_edu       < 8.5        to the right, improve=2.4492750, (0 missing)
##       med_sickdays_hhave < 1.6844     to the left,  improve=1.7536230, (0 missing)
##       household_size     < 4.5        to the right, improve=1.7536230, (0 missing)
##       fs_chwholed_often  < 0.4288448  to the left,  improve=0.7443551, (0 missing)
##   Surrogate splits:
##       children           < 3.5        to the left,  agree=0.855, adj=0.286, (0 split)
##       cons_med_children  < 1.238365   to the right, agree=0.855, adj=0.286, (0 split)
##       household_size     < 3.5        to the right, agree=0.826, adj=0.143, (0 split)
##       med_sickdays_hhave < 2.125778   to the left,  agree=0.826, adj=0.143, (0 split)
## 
## Node number 339: 87 observations,    complexity param=0.01556017
##   predicted class=1  expected loss=0.3908046  P(node) =0.09024896
##     class counts:    34    53
##    probabilities: 0.391 0.609 
##   left son=678 (9 obs) right son=679 (78 obs)
##   Primary splits:
##       fs_chwholed_often < 0.4288448  to the right, improve=7.450928, (0 missing)
##       children          < 1.5        to the left,  improve=5.650287, (0 missing)
##       cons_med_children < 2.100913   to the right, improve=5.650287, (0 missing)
##       age               < 32         to the right, improve=3.321839, (0 missing)
##       years_of_edu      < 12.5       to the right, improve=1.088692, (0 missing)
##   Surrogate splits:
##       children          < 1.5        to the left,  agree=0.977, adj=0.778, (0 split)
##       cons_med_children < 2.100913   to the right, agree=0.977, adj=0.778, (0 split)
## 
## Node number 676: 55 observations
##   predicted class=0  expected loss=0.2909091  P(node) =0.05705394
##     class counts:    39    16
##    probabilities: 0.709 0.291 
## 
## Node number 677: 14 observations
##   predicted class=1  expected loss=0.2857143  P(node) =0.01452282
##     class counts:     4    10
##    probabilities: 0.286 0.714 
## 
## Node number 678: 9 observations
##   predicted class=0  expected loss=0  P(node) =0.0093361
##     class counts:     9     0
##    probabilities: 1.000 0.000 
## 
## Node number 679: 78 observations
##   predicted class=1  expected loss=0.3205128  P(node) =0.08091286
##     class counts:    25    53
##    probabilities: 0.321 0.679 
## 
## n= 964 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 964 482 0 (0.50000000 0.50000000)  
##     2) fs_adwholed_often< 2 755 340 0 (0.54966887 0.45033113)  
##       4) cons_social>=0.08007685 357 115 0 (0.67787115 0.32212885)  
##         8) asset_phone< 63.26071 284  69 0 (0.75704225 0.24295775)  
##          16) ent_total_cost< 3.359891 117  11 0 (0.90598291 0.09401709) *
##          17) ent_total_cost>=3.359891 167  58 0 (0.65269461 0.34730539)  
##            34) years_of_edu>=7.5 125  33 0 (0.73600000 0.26400000) *
##            35) years_of_edu< 7.5 42  17 1 (0.40476190 0.59523810)  
##              70) cons_alcohol>=0.587257 9   0 0 (1.00000000 0.00000000) *
##              71) cons_alcohol< 0.587257 33   8 1 (0.24242424 0.75757576) *
##         9) asset_phone>=63.26071 73  27 1 (0.36986301 0.63013699)  
##          18) fs_meat< 3.5 29  10 0 (0.65517241 0.34482759)  
##            36) fs_meat>=1.5 18   0 0 (1.00000000 0.00000000) *
##            37) fs_meat< 1.5 11   1 1 (0.09090909 0.90909091) *
##          19) fs_meat>=3.5 44   8 1 (0.18181818 0.81818182) *
##       5) cons_social< 0.08007685 398 173 1 (0.43467337 0.56532663)  
##        10) fs_chwholed_often>=0.112069 312 153 1 (0.49038462 0.50961538)  
##          20) household_size< 2.5 10   0 0 (1.00000000 0.00000000) *
##          21) household_size>=2.5 302 143 1 (0.47350993 0.52649007)  
##            42) marital_status>=0.5 244 118 0 (0.51639344 0.48360656)  
##              84) children< 5.5 212  96 0 (0.54716981 0.45283019)  
##               168) household_size>=5.5 56  17 0 (0.69642857 0.30357143) *
##               169) household_size< 5.5 156  77 1 (0.49358974 0.50641026)  
##                 338) age< 24.5 69  26 0 (0.62318841 0.37681159)  
##                   676) age>=17.5 55  16 0 (0.70909091 0.29090909) *
##                   677) age< 17.5 14   4 1 (0.28571429 0.71428571) *
##                 339) age>=24.5 87  34 1 (0.39080460 0.60919540)  
##                   678) fs_chwholed_often>=0.4288448 9   0 0 (1.00000000 0.00000000) *
##                   679) fs_chwholed_often< 0.4288448 78  25 1 (0.32051282 0.67948718) *
##              85) children>=5.5 32  10 1 (0.31250000 0.68750000) *
##            43) marital_status< 0.5 58  17 1 (0.29310345 0.70689655) *
##        11) fs_chwholed_often< 0.112069 86  20 1 (0.23255814 0.76744186) *
##     3) fs_adwholed_often>=2 209  67 1 (0.32057416 0.67942584)  
##       6) med_sickdays_hhave< 1.775 108  51 1 (0.47222222 0.52777778)  
##        12) cons_social>=4.170669 16   0 0 (1.00000000 0.00000000) *
##        13) cons_social< 4.170669 92  35 1 (0.38043478 0.61956522)  
##          26) marital_status< 0.5 10   0 0 (1.00000000 0.00000000) *
##          27) marital_status>=0.5 82  25 1 (0.30487805 0.69512195)  
##            54) ent_total_cost< 0.8713918 13   3 0 (0.76923077 0.23076923) *
##            55) ent_total_cost>=0.8713918 69  15 1 (0.21739130 0.78260870) *
##       7) med_sickdays_hhave>=1.775 101  16 1 (0.15841584 0.84158416) *
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 138  52
##          1  68 154
##                                           
##                Accuracy : 0.7087          
##                  95% CI : (0.6623, 0.7522)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.4175          
##                                           
##  Mcnemar's Test P-Value : 0.1709          
##                                           
##             Sensitivity : 0.6699          
##             Specificity : 0.7476          
##          Pos Pred Value : 0.7263          
##          Neg Pred Value : 0.6937          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3350          
##    Detection Prevalence : 0.4612          
##       Balanced Accuracy : 0.7087          
##                                           
##        'Positive' Class : 0               
## 
## $model
## n= 964 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 964 482 0 (0.50000000 0.50000000)  
##     2) fs_adwholed_often< 2 755 340 0 (0.54966887 0.45033113)  
##       4) cons_social>=0.08007685 357 115 0 (0.67787115 0.32212885)  
##         8) asset_phone< 63.26071 284  69 0 (0.75704225 0.24295775)  
##          16) ent_total_cost< 3.359891 117  11 0 (0.90598291 0.09401709) *
##          17) ent_total_cost>=3.359891 167  58 0 (0.65269461 0.34730539)  
##            34) years_of_edu>=7.5 125  33 0 (0.73600000 0.26400000) *
##            35) years_of_edu< 7.5 42  17 1 (0.40476190 0.59523810)  
##              70) cons_alcohol>=0.587257 9   0 0 (1.00000000 0.00000000) *
##              71) cons_alcohol< 0.587257 33   8 1 (0.24242424 0.75757576) *
##         9) asset_phone>=63.26071 73  27 1 (0.36986301 0.63013699)  
##          18) fs_meat< 3.5 29  10 0 (0.65517241 0.34482759)  
##            36) fs_meat>=1.5 18   0 0 (1.00000000 0.00000000) *
##            37) fs_meat< 1.5 11   1 1 (0.09090909 0.90909091) *
##          19) fs_meat>=3.5 44   8 1 (0.18181818 0.81818182) *
##       5) cons_social< 0.08007685 398 173 1 (0.43467337 0.56532663)  
##        10) fs_chwholed_often>=0.112069 312 153 1 (0.49038462 0.50961538)  
##          20) household_size< 2.5 10   0 0 (1.00000000 0.00000000) *
##          21) household_size>=2.5 302 143 1 (0.47350993 0.52649007)  
##            42) marital_status>=0.5 244 118 0 (0.51639344 0.48360656)  
##              84) children< 5.5 212  96 0 (0.54716981 0.45283019)  
##               168) household_size>=5.5 56  17 0 (0.69642857 0.30357143) *
##               169) household_size< 5.5 156  77 1 (0.49358974 0.50641026)  
##                 338) age< 24.5 69  26 0 (0.62318841 0.37681159)  
##                   676) age>=17.5 55  16 0 (0.70909091 0.29090909) *
##                   677) age< 17.5 14   4 1 (0.28571429 0.71428571) *
##                 339) age>=24.5 87  34 1 (0.39080460 0.60919540)  
##                   678) fs_chwholed_often>=0.4288448 9   0 0 (1.00000000 0.00000000) *
##                   679) fs_chwholed_often< 0.4288448 78  25 1 (0.32051282 0.67948718) *
##              85) children>=5.5 32  10 1 (0.31250000 0.68750000) *
##            43) marital_status< 0.5 58  17 1 (0.29310345 0.70689655) *
##        11) fs_chwholed_often< 0.112069 86  20 1 (0.23255814 0.76744186) *
##     3) fs_adwholed_often>=2 209  67 1 (0.32057416 0.67942584)  
##       6) med_sickdays_hhave< 1.775 108  51 1 (0.47222222 0.52777778)  
##        12) cons_social>=4.170669 16   0 0 (1.00000000 0.00000000) *
##        13) cons_social< 4.170669 92  35 1 (0.38043478 0.61956522)  
##          26) marital_status< 0.5 10   0 0 (1.00000000 0.00000000) *
##          27) marital_status>=0.5 82  25 1 (0.30487805 0.69512195)  
##            54) ent_total_cost< 0.8713918 13   3 0 (0.76923077 0.23076923) *
##            55) ent_total_cost>=0.8713918 69  15 1 (0.21739130 0.78260870) *
##       7) med_sickdays_hhave>=1.775 101  16 1 (0.15841584 0.84158416) *
## 
## $confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 138  52
##          1  68 154
##                                           
##                Accuracy : 0.7087          
##                  95% CI : (0.6623, 0.7522)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.4175          
##                                           
##  Mcnemar's Test P-Value : 0.1709          
##                                           
##             Sensitivity : 0.6699          
##             Specificity : 0.7476          
##          Pos Pred Value : 0.7263          
##          Neg Pred Value : 0.6937          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3350          
##    Detection Prevalence : 0.4612          
##       Balanced Accuracy : 0.7087          
##                                           
##        'Positive' Class : 0               
## 

5.2.2 Summary of Decision Tree model on balanced dataset (Upsampling)

Summarize the model performance: The results show an accuracy of 70.87%, which indicates that the model performs well in distinguishing between the two categories (depressed and non-depressed).The Kappa statistic is 0.4175, which indicates that the model’s predictive power is relatively good. The model showed some validity when dealing with a balanced dataset.

kappa: 0.4175. kappa values between 0.4 and 0.6 indicate that the model has moderate predictive consistency.

Mcnemar’s Test P-Value: 0.1709, which is higher than 0.05, indicating that the difference between the predictions of the positive and negative categories is not statistically significant, and the model is more balanced in predicting the two categories.

Sensitivity: 66.99%. Indicates that the model correctly identifies approximately 67% of non-depressed instances.

Specificity: 74.76%. Indicates that the model correctly identifies approximately 75% of the instances of depression.

Balanced Accuracy: 70.87%. Indicates that the model has excellent performance in handling both categories.

Variable Importance and Split of the Model

Main splits of the decision tree: The model is first split based on “fs_adwholed_often” (Frequency of purchasing full-price food items on a regular basis), which suggests that household food status is an important factor influencing depressive status. Next, health and social factors such as “cons_social” and “med_sickdays_have” are also used as decision nodes.

5.3.1 Logistic model on balanced dataset (down sampling)

## [1] "Number of rows: 1147"
## [1] "Number of columns: 35"
## [1] "Number of missing values: 0"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 20 15
##          1 17 22
##                                           
##                Accuracy : 0.5676          
##                  95% CI : (0.4472, 0.6823)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.1477          
##                                           
##                   Kappa : 0.1351          
##                                           
##  Mcnemar's Test P-Value : 0.8597          
##                                           
##             Sensitivity : 0.5405          
##             Specificity : 0.5946          
##          Pos Pred Value : 0.5714          
##          Neg Pred Value : 0.5641          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2703          
##    Detection Prevalence : 0.4730          
##       Balanced Accuracy : 0.5676          
##                                           
##        'Positive' Class : 0               
## 

5.3.1 Summary of Logistic model on balanced dataset (down sampling)

Summarizing the model performance

This logistic regression model performed mediocrely when dealing with a balanced dataset with an accuracy of 56.76%. This indicates that the model is not very effective in distinguishing between depressed and non-depressed states.The Kappa statistic of 0.1351 indicates that the model has average predictive power.

Accuracy: 56.76%. The overall accuracy of the model is low, indicating its limited discriminatory power.

95% CI (Confidence Interval): (44.72%, 68.23%). Confidence intervals are wide, indicating that estimates of model accuracy are not stable enough.

No Information Rate (NIR): 50%. Indicates that if the model does not have any valid information, the prediction accuracy is 50%.

Kappa: 0.1351. This value indicates that the predictive power of the model is not good.

Mcnemar’s Test P-Value: 0.8597, which indicates that the bias between positive and negative predictions is not significant, i.e., the model’s imbalance between the predictions of the two categories is not significant.

Sensitivity and Specificity:54.05% and 59.46%. These two indicators show that the model is weak in recognizing both positive and negative categories.

Positive Predictive Value, PPV and Negative Predictive Value, NPV: PPV is 57.14% and NPV is 56.41%, which indicates that the model is average in predicting correctness.

Balanced Accuracy: 56.76%, which indicates that the model is average in positive and negative class prediction.

5.3.2 Decision Tree model on balanced dataset (down sampling)

## [1] "Number of rows: 1147"
## [1] "Number of columns: 35"
## [1] "Number of missing values: 0"

## Call:
## rpart(formula = depressed ~ ., data = train_data, method = "class")
##   n= 176 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.23863636      0 1.0000000 1.2045455 0.07378413
## 2 0.05113636      1 0.7613636 0.7954545 0.07378413
## 3 0.03409091      4 0.6022727 0.8863636 0.07488957
## 4 0.02272727      6 0.5340909 0.9204545 0.07513898
## 5 0.01136364      7 0.5113636 0.9090909 0.07506571
## 6 0.01000000      8 0.5000000 0.9090909 0.07506571
## 
## Variable importance
##          years_of_edu    med_sickdays_hhave           ed_expenses 
##                    14                    13                     9 
##               cons_ed           cons_social nondurable_investment 
##                     9                     7                     6 
##                   age        household_size              children 
##                     5                     4                     4 
##       cons_nondurable          cons_allfood           hh_children 
##                     4                     4                     4 
##    durable_investment        ent_total_cost     cons_med_children 
##                     3                     3                     3 
##            cons_other       ed_schoolattend     fs_chwholed_often 
##                     3                     3                     2 
##     fs_adwholed_often          cons_alcohol 
##                     1                     1 
## 
## Node number 1: 176 observations,    complexity param=0.2386364
##   predicted class=0  expected loss=0.5  P(node) =1
##     class counts:    88    88
##    probabilities: 0.500 0.500 
##   left son=2 (141 obs) right son=3 (35 obs)
##   Primary splits:
##       years_of_edu       < 6.5        to the right, improve=7.863830, (0 missing)
##       med_sickdays_hhave < 0.08333333 to the left,  improve=6.731935, (0 missing)
##       ed_expenses        < 51.20914   to the left,  improve=3.927273, (0 missing)
##       cons_allfood       < 186.4439   to the left,  improve=3.880071, (0 missing)
##       cons_social        < 3.269805   to the right, improve=3.705783, (0 missing)
##   Surrogate splits:
##       age                < 46.5       to the left,  agree=0.847, adj=0.229, (0 split)
##       household_size     < 1.5        to the right, agree=0.841, adj=0.200, (0 split)
##       children           < 0.5        to the right, agree=0.830, adj=0.143, (0 split)
##       med_sickdays_hhave < 6.955645   to the left,  agree=0.830, adj=0.143, (0 split)
##       fs_adwholed_often  < 5.25       to the left,  agree=0.812, adj=0.057, (0 split)
## 
## Node number 2: 141 observations,    complexity param=0.05113636
##   predicted class=0  expected loss=0.4255319  P(node) =0.8011364
##     class counts:    81    60
##    probabilities: 0.574 0.426 
##   left son=4 (28 obs) right son=5 (113 obs)
##   Primary splits:
##       med_sickdays_hhave     < 0.08333333 to the left,  improve=5.583452, (0 missing)
##       ed_expenses            < 45.48365   to the left,  improve=5.579527, (0 missing)
##       cons_ed                < 3.790304   to the left,  improve=4.362527, (0 missing)
##       cons_social            < 0.1648248  to the right, improve=3.451186, (0 missing)
##       asset_land_owned_total < 1.25       to the right, improve=2.760732, (0 missing)
##   Surrogate splits:
##       cons_alcohol      < 4.186875   to the right, agree=0.816, adj=0.071, (0 split)
##       cons_med_children < 7.206916   to the right, agree=0.809, adj=0.036, (0 split)
## 
## Node number 3: 35 observations
##   predicted class=1  expected loss=0.2  P(node) =0.1988636
##     class counts:     7    28
##    probabilities: 0.200 0.800 
## 
## Node number 4: 28 observations
##   predicted class=0  expected loss=0.1428571  P(node) =0.1590909
##     class counts:    24     4
##    probabilities: 0.857 0.143 
## 
## Node number 5: 113 observations,    complexity param=0.05113636
##   predicted class=0  expected loss=0.4955752  P(node) =0.6420455
##     class counts:    57    56
##    probabilities: 0.504 0.496 
##   left son=10 (102 obs) right son=11 (11 obs)
##   Primary splits:
##       ed_expenses     < 45.48365   to the left,  improve=4.167589, (0 missing)
##       ed_schoolattend < 0.7321429  to the left,  improve=3.073525, (0 missing)
##       cons_allfood    < 150.3553   to the left,  improve=3.063232, (0 missing)
##       cons_ed         < 3.790304   to the left,  improve=3.063232, (0 missing)
##       cons_nondurable < 155.5233   to the left,  improve=2.758684, (0 missing)
##   Surrogate splits:
##       cons_ed     < 3.790304   to the left,  agree=0.991, adj=0.909, (0 split)
##       cons_social < 10.10303   to the left,  agree=0.929, adj=0.273, (0 split)
## 
## Node number 10: 102 observations,    complexity param=0.05113636
##   predicted class=0  expected loss=0.4509804  P(node) =0.5795455
##     class counts:    56    46
##    probabilities: 0.549 0.451 
##   left son=20 (33 obs) right son=21 (69 obs)
##   Primary splits:
##       cons_social       < 0.7607301  to the right, improve=3.100054, (0 missing)
##       cons_med_children < 2.80633    to the right, improve=1.844910, (0 missing)
##       asset_durable     < 157.351    to the right, improve=1.593137, (0 missing)
##       cons_nondurable   < 55.43596   to the right, improve=1.551191, (0 missing)
##       cons_allfood      < 27.7962    to the right, improve=1.551191, (0 missing)
##   Surrogate splits:
##       cons_nondurable       < 55.43596   to the right, agree=0.902, adj=0.697, (0 split)
##       cons_allfood          < 27.7962    to the right, agree=0.902, adj=0.697, (0 split)
##       ent_total_cost        < 0.03336535 to the right, agree=0.892, adj=0.667, (0 split)
##       durable_investment    < 42.50397   to the right, agree=0.892, adj=0.667, (0 split)
##       nondurable_investment < 0.7240282  to the right, agree=0.892, adj=0.667, (0 split)
## 
## Node number 11: 11 observations
##   predicted class=1  expected loss=0.09090909  P(node) =0.0625
##     class counts:     1    10
##    probabilities: 0.091 0.909 
## 
## Node number 20: 33 observations
##   predicted class=0  expected loss=0.2727273  P(node) =0.1875
##     class counts:    24     9
##    probabilities: 0.727 0.273 
## 
## Node number 21: 69 observations,    complexity param=0.03409091
##   predicted class=1  expected loss=0.4637681  P(node) =0.3920455
##     class counts:    32    37
##    probabilities: 0.464 0.536 
##   left son=42 (61 obs) right son=43 (8 obs)
##   Primary splits:
##       hh_children       < 2.5        to the left,  improve=2.077037, (0 missing)
##       cons_ed           < 0.3002882  to the left,  improve=1.627315, (0 missing)
##       fs_chwholed_often < 0.3434622  to the right, improve=1.620659, (0 missing)
##       cons_med_children < 0.6880889  to the right, improve=1.489211, (0 missing)
##       years_of_edu      < 10.5       to the right, improve=1.196034, (0 missing)
##   Surrogate splits:
##       cons_ed               < 0.3002882  to the left,  agree=0.971, adj=0.750, (0 split)
##       cons_other            < 20.05925   to the left,  agree=0.971, adj=0.750, (0 split)
##       ed_schoolattend       < 0.25       to the left,  agree=0.971, adj=0.750, (0 split)
##       ed_expenses           < 3.603458   to the left,  agree=0.957, adj=0.625, (0 split)
##       nondurable_investment < 0.854153   to the left,  agree=0.957, adj=0.625, (0 split)
## 
## Node number 42: 61 observations,    complexity param=0.03409091
##   predicted class=0  expected loss=0.4918033  P(node) =0.3465909
##     class counts:    31    30
##    probabilities: 0.508 0.492 
##   left son=84 (40 obs) right son=85 (21 obs)
##   Primary splits:
##       age                < 26.5       to the right, improve=1.0370410, (0 missing)
##       household_size     < 4.5        to the right, improve=1.0089660, (0 missing)
##       med_sickdays_hhave < 1.6844     to the left,  improve=0.9841780, (0 missing)
##       years_of_edu       < 10.5       to the right, improve=0.7503067, (0 missing)
##       marital_status     < 0.5        to the right, improve=0.5608942, (0 missing)
##   Surrogate splits:
##       household_size     < 4.5        to the right, agree=0.738, adj=0.238, (0 split)
##       children           < 2.5        to the right, agree=0.721, adj=0.190, (0 split)
##       fs_chwholed_often  < 0.2787356  to the right, agree=0.721, adj=0.190, (0 split)
##       fs_meat            < 1.5        to the right, agree=0.689, adj=0.095, (0 split)
##       med_sickdays_hhave < 2.018315   to the left,  agree=0.689, adj=0.095, (0 split)
## 
## Node number 43: 8 observations
##   predicted class=1  expected loss=0.125  P(node) =0.04545455
##     class counts:     1     7
##    probabilities: 0.125 0.875 
## 
## Node number 84: 40 observations,    complexity param=0.02272727
##   predicted class=0  expected loss=0.425  P(node) =0.2272727
##     class counts:    23    17
##    probabilities: 0.575 0.425 
##   left son=168 (20 obs) right son=169 (20 obs)
##   Primary splits:
##       cons_med_children < 1.249358   to the right, improve=1.2500000, (0 missing)
##       age               < 35.5       to the left,  improve=0.8632832, (0 missing)
##       children          < 3.5        to the left,  improve=0.8632832, (0 missing)
##       fs_chwholed_often < 0.3642956  to the left,  improve=0.2317043, (0 missing)
##       household_size    < 6.5        to the left,  improve=0.1928571, (0 missing)
##   Surrogate splits:
##       children           < 3.5        to the left,  agree=0.825, adj=0.65, (0 split)
##       med_sickdays_hhave < 1.525668   to the right, agree=0.800, adj=0.60, (0 split)
##       fs_chwholed_often  < 0.3642956  to the left,  agree=0.775, adj=0.55, (0 split)
##       household_size     < 5.5        to the left,  agree=0.725, adj=0.45, (0 split)
##       years_of_edu       < 8.5        to the left,  agree=0.625, adj=0.25, (0 split)
## 
## Node number 85: 21 observations
##   predicted class=1  expected loss=0.3809524  P(node) =0.1193182
##     class counts:     8    13
##    probabilities: 0.381 0.619 
## 
## Node number 168: 20 observations
##   predicted class=0  expected loss=0.3  P(node) =0.1136364
##     class counts:    14     6
##    probabilities: 0.700 0.300 
## 
## Node number 169: 20 observations,    complexity param=0.01136364
##   predicted class=1  expected loss=0.45  P(node) =0.1136364
##     class counts:     9    11
##    probabilities: 0.450 0.550 
##   left son=338 (7 obs) right son=339 (13 obs)
##   Primary splits:
##       age               < 42.5       to the right, improve=0.3175824, (0 missing)
##       children          < 5          to the left,  improve=0.1500000, (0 missing)
##       household_size    < 6.5        to the left,  improve=0.1500000, (0 missing)
##       years_of_edu      < 9.5        to the left,  improve=0.1000000, (0 missing)
##       cons_med_children < 1.136585   to the left,  improve=0.1000000, (0 missing)
##   Surrogate splits:
##       children          < 3          to the left,  agree=0.80, adj=0.429, (0 split)
##       cons_med_children < 0.1501441  to the left,  agree=0.80, adj=0.429, (0 split)
##       fs_chwholed_often < 0.1666667  to the left,  agree=0.80, adj=0.429, (0 split)
##       cons_nondurable   < 49.48167   to the right, agree=0.75, adj=0.286, (0 split)
##       asset_durable     < 69.1864    to the right, agree=0.75, adj=0.286, (0 split)
## 
## Node number 338: 7 observations
##   predicted class=0  expected loss=0.4285714  P(node) =0.03977273
##     class counts:     4     3
##    probabilities: 0.571 0.429 
## 
## Node number 339: 13 observations
##   predicted class=1  expected loss=0.3846154  P(node) =0.07386364
##     class counts:     5     8
##    probabilities: 0.385 0.615 
## 
## n= 176 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 176 88 0 (0.50000000 0.50000000)  
##     2) years_of_edu>=6.5 141 60 0 (0.57446809 0.42553191)  
##       4) med_sickdays_hhave< 0.08333333 28  4 0 (0.85714286 0.14285714) *
##       5) med_sickdays_hhave>=0.08333333 113 56 0 (0.50442478 0.49557522)  
##        10) ed_expenses< 45.48365 102 46 0 (0.54901961 0.45098039)  
##          20) cons_social>=0.7607301 33  9 0 (0.72727273 0.27272727) *
##          21) cons_social< 0.7607301 69 32 1 (0.46376812 0.53623188)  
##            42) hh_children< 2.5 61 30 0 (0.50819672 0.49180328)  
##              84) age>=26.5 40 17 0 (0.57500000 0.42500000)  
##               168) cons_med_children>=1.249358 20  6 0 (0.70000000 0.30000000) *
##               169) cons_med_children< 1.249358 20  9 1 (0.45000000 0.55000000)  
##                 338) age>=42.5 7  3 0 (0.57142857 0.42857143) *
##                 339) age< 42.5 13  5 1 (0.38461538 0.61538462) *
##              85) age< 26.5 21  8 1 (0.38095238 0.61904762) *
##            43) hh_children>=2.5 8  1 1 (0.12500000 0.87500000) *
##        11) ed_expenses>=45.48365 11  1 1 (0.09090909 0.90909091) *
##     3) years_of_edu< 6.5 35  7 1 (0.20000000 0.80000000) *
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 19 15
##          1 18 22
##                                           
##                Accuracy : 0.5541          
##                  95% CI : (0.4339, 0.6698)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.2080          
##                                           
##                   Kappa : 0.1081          
##                                           
##  Mcnemar's Test P-Value : 0.7277          
##                                           
##             Sensitivity : 0.5135          
##             Specificity : 0.5946          
##          Pos Pred Value : 0.5588          
##          Neg Pred Value : 0.5500          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2568          
##    Detection Prevalence : 0.4595          
##       Balanced Accuracy : 0.5541          
##                                           
##        'Positive' Class : 0               
## 

5.3.2 Summary of Decision Tree model on balanced dataset (down sampling)

Accuracy: 55.41%

Kappa: 0.1081, the model’s ability to predict is not good.

Node: Split based on years_of_edu, which suggests that this variable is an important factor in distinguishing between the two categories (depressed or not).

FIRST LEVEL SEGMENTATION: Further segmentation is done based on med_sickdays_have which shows that health status is also an important factor that affects depression status.

Deeper nodes: Various variables such as ed_expenses, cons_social etc. are used in deeper nodes which shows that the model tries to categorize through several different features to increase the accuracy of decision making.